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SECTION  1 
OVERVIEW 


I. 1  INTRODUCTION 

The  objective  of  weapon-target  assignment  (WTA)  in  a  ballistic  missile  defense  (BMD) 
system  is  to  determine  how  defensive  weapons  should  be  assigned  to  boosters  and  re-entry 
vehicles  in  order  to  maximize  the  survival  of  assets  belonging  to  the  U.S.  and  allied  countries. 
The  implied  optimization  problem  requires  consideration  of  a  large  number  of  potential  weapon 
target  assignments  in  order  to  select  the  most  effective  combination  of  assignments.  The 
resulting  WTA  optimization  problems  are  among  the  most  complex  encountered  in 
mathematical  programming  [1,2].  Indeed,  simple  versions  of  the  WTA  problem  have  been 
shown  to  be  NP-complete  [3,4],  implying  that  the  computations  required  to  achieve  optimal 
solutions  grow  exponentially  with  the  numbers  of  weapons  and  targets  considered  in  the 
solution. 

The  computational  complexity  of  the  WTA  problem  has  motivated  the  development  of 
heuristic  algorithms  that  are  not  altogether  satisfactory  for  use  in  Strategic  Defense  Systems 
(SDS).  Some  special  cases  of  the  WTA  problem  are  not  NP-complete  and  can  be  solved  using 
standard  optimization  algorithms  such  as  linear  programming  [5]  and  maximum  marginal  return 
algorithms  [6,7];  these  algorithms  enjoy  low  computational  requirements  and  therefore  have 
been  adopted  as  heuristics  for  solving  more  general  WTA  problems.  However,  experimental 
studies  [2,8,9,10]  have  demonstrated  that  these  heuristic  algorithms  lead  to  significantly 
suboptimal  solutions  for  certain  scenarios. 

In  order  to  address  this  deficiency,  the  Strategic  Defense  Initiative  Office  initiated 
several  research  efforts  to  develop  efficient,  near-optimal  boost-phase  and  midcoursc-phase 
WTA  algorithms  for  directed  energy  w'capons  [11]  and  kinetic  energy  weapons 

I I,  2.9,1 1,12,13).  These  programs  developed  advanced  optimization-based  WTA  algorithms 
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which  achieved  improved  performance  over  the  existing  SDS  WTA  algorithms,  but  wl.ich 
required  increased  computation  in  order  to  be  implemented  as  part  of  a  real-time  system. 

Among  the  most  successful  WTA  algorithms  developed  was  the  ILINE  algorithm  [8,9]  and  its 
subsequent  extensions  [2,10]  for  assignment  of  kinetic  kill  interceptors.  The  merits  of  the 
ILINE  algorithm  for  Boost  and  Post-Boost  WTA  were  established  in  the  Air  Force's  Space- 
Based  Experimental  Version  program  [10]  sponsored  by  ESD;  in  this  program,  various 
candidate  WTA  algorithms  were  studied,  and  ILINE  was  selected  and  implemented  as  the 
superior  algorithm  for  performance  of  the  WTA  function.  The  ILINE  algorithm,  was  made 
available  to  the  SDI  Battle  Management  community,  and  was  evaluated  in  both  the  Air  Force's 
[10]  and  the  Army's  [14]  BM/C3  Experimental  Version  programs  for  weapon-target 
assignment. 

The  major  limitation  of  the  ILINE  algorithm  for  SDS  WTA  is  the  computation  time 
required  for  selecting  near-optimal  weapon-target  assignments  in  scenarios  with  large  numbers 
of  interceptors  and  targets.  For  Boost-Phase  WTA,  the  ILINF.  algorithm  may  have  to  solve 
WTA  problems  with  800-1000  targets  in  the  order  of  1-3  seconds  in  order  to  fit  within  a 
reasonable  fraction  of  the  overall  real-time  planning  cycle.  For  Midcourse  'WTA,  the  ILINE 
algorithm  inay  be  imbedded  into  a  dynamic  Battle  Planning  algorithm  which  requires  10-100 
iterative  applications  of  the  ILINE  algorithm  (the  extra  iterations  are  required  for  adaptive 
preferential  defense  and  predictive  battle  planning,  as  discussed  in  [2]).  Each  of  these 
iterations  require  the  application  of  ILINE  for  WTA  problems  with  up  to  10,000  targets:  the 
overall  dynamic  Battle  Planning  algorithm  computations  must  be  completed  within  2-10 
seconds  in  order  to  fit  within  a  reasonable  fraction  of  the  overall  real-time  midcourse  planning 
cycle. 

As  a  point  of  reference,  the  results  of  [2]  indicate  that  the  computation  time  for  a  single 
application  of  the  ILINE  algorithm  for  problems  involving  up  to  4500  tiu'gets  will  require  about 
300  seconds  of  CPU  time  on  a  .5  MIPS  sequential  processor,  and  that  the  computation  time 
grou's  near-lineaiiy  with  the  number  of  targets.  This  ext.mpolates  to  over  600  seconds  of 
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computation  time  for  each  application  of  the  ILINE  algorithm  for  10,000  targets.  Thus,  we 
need  to  achieve  up  to  four  orders  of  magnitude  reduction  in  computation  time  from  the 
sequential  computation  time  on  a  .5  MIPS  processor. 

These  lofty  goals  appear  beyond  the  scope  of  single-processor  technology  in  the  near¬ 
future.  However,  the  structure  of  the  ILINE  algorithm  suggested  that  significant  reductions  in 
computation  time  could  be  achieved  through  parallel  processing,  so  that  a  combination  of 
processor  technology  improvements  and  parallel  pnx:essing  could  be  used  to  achieve  the 
desired  real-time  computation  goals.  The  purpose  of  the  phase  one  research  was  to 
demonstrate  the  potential  reductions  in  the  computation  time  of  the  ILINE  algorithm  which  can 
be  achieved  on  different  multiprocessor  architectures  by  developing  and  benchmarking 
different  parallel  variations  of  the  ILINE  algorithm  on  commercial  multiprocessors.  The 
resulting  parallel  WTA  algorithms  provide  the  basis  for  real-time  WTA  algorithm  development 
using  multiprocessor  architectures;  furthermore,  the  benchmarking  results  can  be  used  to 
identify  characteristics  of  desirable  computer  architectures  for  efficient  execution  of  WTA 
algorithms. 

1.2  OVERVIEW  OF  PHASE  1  RESULTS 

The  basis  of  the  ILINE  WTA  algorithm  is  to  solve  a  sequence  of  linear  assignment 
problems  [1  using  Bertsekas'  AUCTION  ( 16,17]  algorithm  (as  e.xtended  by  Bertsekas  and 
Castanon  [18]).  Each  application  of  ILINE  requires  the  solution  of  4-6  assignment  problems 
using  AUCT’ION  Depending  on  the  size  of  the  problem,  over  95%  of  the  overall  ILINE 
computation  time  is  spent  in  the  AUCTION  algorithm.  Thus,  the  key  to  developing  parallel 
versions  of  the  ILINE  algorithm  is  to  develop  parallel  versions  of  the  underlying  .AUCTION 
algorithm.  The  AUCTION  algorithm  is  a  recently-developed  optimal  algorithm  for  the  solution 
of  classical  assignment  problems  (finding  an  optimal  one-to-one  match  from  n  persons  to  n 
objects  in  order  to  maximize  the  sum  of  the  indiviilual  benefits  assextiated  with  each  person- 
object  match).  Assignment  problems  are  important  in  many  aspects  of  SDS  besides  weapon- 
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assignment;  these  additional  applications  including  single-sensor,  multiple  trame  association 
for  multiobject  tracking  and  multi-sensor  correlation. 

The  Al.ICTION  algorithm  has  been  shown  to  be  a  ver\'  effectiv-*  sctiuential  assignment 
algorithm,  substantially  outperfonning  its  rivals  for  sparse  problems,  'fhe  algorithm  operates 
like  an  auction,  whereby  at  each  iteration,  unassigned  persons  bid  simultaneously  for  objects 
thereby  raising  their  prices.  Objects  arc  then  awarded  to  the  highest  bidder.  The  AUCflO.N 
algorithm  was  also  designed  with  an  orientation  towards  parallel  implementation,  making  it  an 
ideal  starting  point  for  our  investigations. 

Through  analysis  of  the  structure  of  the  AUCflON  algorithm,  we  identified  two 
different  levels  where  parallel  processing  could  be  used  to  speed  up  the  computations:  a 
medium-grained  level  and  a  fine-grained  level.  The  medium-grained  level  (refer  ed  to  as  the 
Jacol:ii  level,  due  to  its  similarity  to  the  iterative  Jacobi  algorithm  for  recursive  .solution  of  linear 
equaticins)  consisted  of  parallel  processing  multiple  weapon-target  pairs  simultaneously,  while 
the  fine-grained  level  (referred  to  as  the  Gauss-Seidel  level  I  consisted  of  processing  multiple 
targets  for  a  single  weapon  simailtaneously.  Ideally,  an  effective  parallel  algorithm  would 
combine  the  potential  speedups  achievable  at  each  level  in  a  multiplicative  fashion. 

In  order  to  explore  the  potential  for  parallel  implementation  on  different  multiproces.sor 
:u-chi lectures,  we  developed  and  implemented  the  following  versions  of  the  AUCTION 
algorithm  on  different  multiprocessor  architectures  at  the  Advanced  Computing  Research 
Facility  (ACRF)  at  Argonne  National  Laboratory’ 

1 .  Two  versions  of  Jacobi  AUCTION  on  the  Bncore  Muitimax  using  sparse  data 
structures  (one  synchronous,  one  asynchronous) 

2.  Gauss-Seidel  AUCTION  on  the  Encore  Multimax  using  sparse  data  structures 

Three  versions  of  Hybrid  AUCTION  on  the  Encore  Multimax  using  sparse  data 
structures  (one  synchronous,  two  asynchronous) 

4.  Gauss-Seidel  AUCT'ION  on  the  Alliant  FX/8  using  sparse  data  structures 

’  Access  to  the  At'KF  was  arranged  itirough  SDIO  sponsorship  by  C'api.  S.  Johnson  ol  .SDK). 
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5.  Gau  ,  jcidel  AUCTION  on  ihe  Alliant  FX/S  using  dense  data  stnictiires 

6  jauss-Seidel  AUCTION  on  the  Connection  Machine  CM-2  using  dense  data 
structures 

7 .  Gauss-Seidel  AUCTION  on  the  DAP  5 10  using  dense  data  structures 

Figure  !  -1  illustrates  the  speedups  obtained  from  medium-grained  parallelization 
(Jacobi  partillelization)  for  several  8(X)  and  1000  target  assignment  problems  with  different 
feasible  intercept  densities  (average  percentage  of  total  targets  which  can  be  attacked  by  each 
weapon),  as  established  by  the  shared-memory  implementation  of  the  Jacobi  AUCTION 
algorithm  on  the  Encore  Multimax.  As  the  results  of  Fig.  1-1  indicate,  the  maximum  speedup 
achievable  by  Jacobi  parallelization  is  approximately  4,  independent  of  the  density  of  feasible 
intercepts  (and  also  nearly  independent  o^  the  number  of  targets  available!).  This  important 


limitation  is  due  to  the  incremental  nature  of  the  AUCTION  algorithm  (described  in  greater 


figure  II.  .Speedup  of  parallel  Jacobi  .'miCFION  algorithm  over  the  single- 
procosor  .ilgorithm  on  the  Encore  Multimax  ;is  a  function  of  the 
deiis'ly  of  teasible  interceptor  assignments  for  problems  with  SIX)  and 
I (KK)  targets. 
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df^nsity  of  feasible  interceptor  assignments  (iinlil.e  the  speedups  obtained  from  Jacobi 
parallelization).  Specitically,  the  speedups  increase  with  the  average  number  of  targets  w  Inch 
can  be  attacked  by  each  inte^'ceptor.  Thus,  these  speedups  will  increase  as  the  number  of 
ta’  gets  grows  as  well  as  with  increased  feasible  intercept  density  for  a  fixed  numbc'’  of  targets. 

Figure  1-2  also  illustrates  an  interesting  tradeoff  between  the  use  of  sophisticated  data 
structures  and  the  speed  of  computation.  For  problems  where  the  density  of  feasible 
interceptor-target  assignments  is  less  than  one  (i.e.  each  interceptor  can  only  reach  a  fraction  of 
the  available  targets),  sparse  data  structures  can  be  used  to  keep  track  only  of  the  feasible 
intercepts  for  each  weapon.  Howe  'er,  such  a  ’epresentation  hinders  efficient  computation  in 
SIMD  architectures  with  limited  communications,  because  of  the  required  data  movements 
among  processors.  The  most  efficient  SIMD  implementaJons  are  those  which  avoid 
communications;  however,  this  often  requires  alignment  of  the  feasible  intercept  data, 
precluding  the  use  of  sparse  data  structures.  On  advanced  MIMD  processors  such  as  the 
Encore  Mulfirnax  and  the  Alliant  FX/8,  interprocessor  communications  are  less  costly,  so  that 
effxcient  implementations  of  the  AUCTION  algorithm  using  sparse  data  structures  are  possible. 
Note  in  panicular  the  differences  in  performance  oftlie  Alliant  Gauss-SeiJe!  AUCTION 
algorithms  using  sparse  and  dense  data  structures. 

An  important  result  which  was  established  in  the  research  was  the  potential  for 
combination  of  the  Jacobi  and  Gauss-Seidel  speedups.  The  hybrid  AUCTION  algorithm  on 
the  Encore  Multimax  was  one  such  implementation,  using  two  processors  at  a  Jacobi  level  am 
a  variable  number  of  processors  at  the  Gauss-Seidel  level.  Note,  however,  that  the  speedup  of 
the  liybrid  AUCTION  algorithm  in  Fig  1-2  is  far  less  than  a  multiplicative  combination  of  the 
Gauss-Seidel  and  Jacobi  AUCTION  algorifhm  speedups.  The  principal  limit  \tion  in  this 
combination  is  tlic  time  required  for  synchroniztttion  of  the  vimious  processors.  Figure  1  -.s 
illusti  tues  the  growth  in  the  synchronization  time  of  the  hybrid  AUCd'ION  algorithm  as  the 
number  of  processors  is  increased. 
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Figure  1-3,  Performance  of  the  synchronous  Hybrid  AUCTION  algorithm  as  a 
function  of  the  number  of  processors  for  1000  target,  20%  dense 
assignment  problem. 


In  order  to  reduce  the  overall  synchronization  time  of  the  hybrid  AUCTION  algorithm, 
we  designed  a  new  asynchronous  version  of  the  hybrid  AUCTION  algorithm  and  proved  its 
convergence  to  a  correct  solution.  We  also  implemented  this  asynchronous  hybrid  AUCTION 
algorithm  and  verified  that  significant  pterformance  improvements  were  possible  over  the 
synchronous  hybrid  AUCTION  algorithm.  Figure  1-4  illustrates  the  performance  of  the 
synchronous  and  asynchronous  Hybrid  AUCTION  algorithms  on  the  Encore  Multimax  for 
several  1000  target  problems.  As  the  results  indicate,  the  asynchronous  algorithms  permit  a 
more  efficient  utilization  of  large  numbers  of  processors,  by  reducing  the  synchronization 
overliead,  leading  to  significant  reductions  (nearly  50%)  in  computation  time. 

The  results  of  Figs.  1-1,  1-2,  1-3  and  1-4  illustrate  the  extent  to  which  the  research 
goals  of  phase  1  have  been  met.  In  essence,  our  results  establish  that  significant  speedups  are 
possible  for  W  TA  algorithms  using  miiltiproce.ssor  architectures;  based  on  the  expected  size  of 
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Figure  1-4.  Performance  of  synchronous  Hybrid  AUCTION  and  asynchronous 
Hybrid  AUCTION  I  algorithms  on  1000  target  assignment  problems 
of  varying  density. 


the  scenarios,  proper  choice  of  multiprocessor  architecture  and  parallel  algorithm 
implementation  ought  to  reduce  the  overall  WTA  computation  requirements  to  fit  as  part  of  real¬ 
time  Battle  Management  processing  software. 

The  results  of  this  research  suggest  that  a  superior  architecture  for  assignment  problems 
using  the  AUCTION  algorithm  must  be  able  to  exploit  both  Jacobi  and  Gauss-Seidel 
piu'allelism.  Exploitation  of  Gauss-Seidel  parallelism  is  best  done  by  SIMD  processors  capable 
of  simultaneous  associative  processing  for  vectors  of  significant  length  (such  as  the  DAP  51  O'). 
Exploitation  of  Jacobi  parallelism  is  best  done  by  MIMD  processors  with  flexible 
communications  structure,  capable  of  fa.st  interproce.s.sor  communication.  Our  prototype 
algorithm  benchnuirks  indicate  that  architectures  which  successfully  combine  these  features 
should  reduce  the  computation  requirements  of  the  AUCTION  algorithm  by  two  orders  of 
uKignitude  when  compared  to  a  Von  Neumann  architecture  for  problems  involving  KXX) 
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targets.  For  larger  problems  involving  10,(KX)  targets,  the  potential  speedups  from  Gauss- 
Seidcl  parallelization  should  increase  by  an  order  of  magnitude,  leading  to  reductions  of  the 
computation  requirements  of  the  AUCTION  algorithm  by  nearly  three  orders  of  magnitude. 
These  reductions  approach  the  real-time  computation  requirements  (four  orders  of  magnitude 
reduction  to  the  .5  MIPS  sequential  processing  time)  discussed  earlier,  coupled  with  advances 
in  individual  processor  technology,  the  parallel  algorithms  (implemented  in  appropriate 
multiprocessor  architectures)  can  be  projected  to  meet  the  required  real-time  deadlines. 

1.3  REVIEW  OF  RELATED  PARALLEL  ALGORITHM  WORK 

Development  of  parallel  WTA  algorithms  has  been  recognized  as  a  difficult  problem;  in 
essence,  the  nature  of  the  WTA  problem  requires  that  a  global  search  among  many  alternatives 
be  conducted  in  order  to  obtain  a  set  of  near-optimal  assignments.  The  global  nature  of  this 
processing  makes  efficient  distribution  among  multiple  processors  a  difficult  task.  Indeed, 
several  early  efforts  at  developing  parallel  WTA  algorithms  (based  on  the  AUCTION 
algorithm)  conducted  at  Los  Alamos  National  Laboratory  [19]  and  Argonne  National 
Laboratory  [20]  obtained  very  limited  speedups  using  shared-memory  MIMD  architectures.  A 
similar  study  conducted  at  the  Jet  Propulsion  Laboratory  (JPL)  of  the  California  Institute  of 
Technology  [21]  using  a  heuristic  WTA  algorithm  implemented  on  a  message-passing  MIMD 
multiprocessor  achieved  no  significant  speedup. 

In  addition  to  SDS-sponsored  efforts  on  parallel  algorithms,  there  has  been  a  set  of 
recent  research  results  on  the  development  of  parallel  algorithms  for  assignment  problems. 
Kempa,  Kennington  and  Zaki  [22]  have  reported  on  the  parallel  performance  of  the  AUCTION 
algorithm  on  dense  assignment  problems  when  implemented  on  the  Alliant  FX/8.  The 
particular  variation  of  the  AUCTION  algorithm  which  they  implemented  addressed  only  fully 
dense  assignment  problems,  and  did  not  include  sparse  data  structures  or  address  the  issue  of 
algorithms  for  different  multiprocessor  architectures.  In  their  implementation  of  the  Jacobi 
AUCTION  algorithm  on  the  Alliant  FX/S,  they  used  a  synchronous  hybrid  algorithm  which 
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uses  the  vector  processing  capability  of  each  of  the  Alliant's  processors  to  scan  the  admissible 
objects  for  each  bid,  and  uses  multiple  processors  to  process  several  bids  in  parallel.  This 
hybrid  algorithm  is  similar  in  spirit  to  the  recommended  approach  of  combining  the  SIMD  and 
MIMD  speedups.  However,  their  hybrid  algorithm  only  achieved  a  speedup  of  near  8  for  1000 
person  assignment  problems  when  compared  with  the  single  processor  version  of  the  same 
algorithm  because  of  the  short  length  of  the  vector  processors  on  the  Alliant  FX/8. 

Furthermore,  they  did  not  compare  their  parallel  algorithm  results  with  an  efficient  sequential 
algorithm  implementation,  so  they  may  have  overestimated  the  true  speedups  achieved  on  the 
Alliant  FX/8. 

Recently,  Balas,  Miller,  Pekny  and  Toth  [23]  have  developed  a  synchronoti*  po-nllel 
assignment  algorithm  based  on  a  successive  shortest  path  algorithm  (rather  than  the  AUCTION 
algorithm)  and  have  implemented  it  successfully  on  a  14-processor  Butterfly  Plus  computer. 
Their  algorithm  is  the  extension  of  Jacobi  parallelization  for  successive  shortest  path  methods, 
since  it  handles  the  assignment  of  multiple  weapons  in  parallel.  However,  the  synchronization 
required  in  the  algorithm  limits  the  effective  speedups  of  the  parallel  shortest  path  algorithm  to 
under  two  for  problems  with  KXK)  persons.  Unlike  the  AUCFION  algorithm  theory  described 
subsequently,  a  theory  of  asynchronous  assignment  algorithms  based  on  successive  shortest 
paths  is  not  available  at  this  time. 

Kennington  and  Wang  [24]  have  also  reported  on  parallel  implementation  of  a 
successive  shortest  path  algorithm  (the  JV  algorithm)  for  dense  assignment  problems  on  the  8- 
proccssor  Sequent  Symmetry  S81.  In  their  implementation,  multiple  processors  are  used  to 
construct  shortest  paths  from  a  single  unassigned  p)erson.  This  is  the  extension  of  the  Gauss- 
Seidel  parallelization  for  successive  shortest  path  methods.  For  problems  with  1000  persons, 
Kennington  and  Wang  obtained  a  speedup  factor  of  3.7  using  8  processors  on  the  Sequent 
Symmetry. 

For  SIMD  architectures,  Zenios  and  Phillips  j25|  have  experimented  with  variations  of 
the  Jacobi  AUCTION  algorithm  on  the  Connection  Machine  CM-2.  By  spreading  the 
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infomiation  corresponding  to  potential  individual  assignments  over  large  numbers  of 
processors,  they  are  able  to  implement  a  SIMD  vjiriation  of  our  Hybrid  AUCTION  algorithm. 
However,  the  performance  of  their  implementation  has  been  disappointingly  slow  (even  though 
it  was  implemented  in  the  C-Paris  assembly  language);  for  problems  involving  1000  persons, 
their  computation  ume  on  the  CM-2  achieves  a  speedup  factor  of  under  3  when  compared  with 
the  sequential  computation  time  of  our  Gauss-Seidel  AUCTION  algorithm  on  a  single 
processor  of  the  Encore  Multimtix! 

The  results  presented  in  this  report  extend  and  unify  a  number  of  the  above  studies 
using  the  AUCTION  algorithm.  By  studying  ctu-efully  the  structure  of  the  AUCTION 
algorithm,  we  have  identified  superior  designs  for  parallel  algorithms  which  can  be  tailored  to 
each  multiprocessor  architecture.  Our  comparative  study  of  different  implementations  of  the 
Gauss-Seidel  AUCTION  on  different  multiprocessor  architectures  provides  interesting  insights 
into  the  specific  advantages  and  disadvantages  of  each  multiproces.sor  architecture,  rather  than 
reflect  on  the  specifics  of  any  one  implementation  on  a  single  architecture.  Indeed,  our  results 
suggest  that  many  of  the  speedups  obtained  in  previous  results  can  be  attributed  to  poor 
implementation  of  the  sequential  algorithms.  In  contrast,  we  have  used  the  most  efficient 
variations  of  the  sequential  AUCTION  algorithms  for  our  benchmarks;  these  variations  were 
developed  in  cooperation  with  Prof.  D.  Bertsekas  of  MIT,  the  originator  of  the  AUCTION 
algorithm. 

Furthermore,  the  theory  and  benchmarking  results  developed  for  the  asynchronous 
variation  of  the  Hybrid  AUCTION  algorithm  provides  the  basis  for  the  design  of  asynchronous 
AUCTION  algorithms  which  will  operate  efficiently  with  greatly  reduced  communications  and 
synchronization.  These  asynchronous  algorithms  should  be  suitable  for  implementation  in 
distributed  memory  MIMD  architectures  or  in  more  advanced  hybrid  architectures  which 
combine  desirable  features  of  SIMD  and  MIMD  ju'chitectures. 
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1.4  IDEAS  FOR  FOLLOW-ON  RESEARCH 

The  results  obtained  under  this  phase  I  research  study  provide  ample  evidence  that, 
with  a  proper  combination  of  parallel  WTA  algorithm  and  multiprocessor  architecture, 
development  of  real-time  Battle  Planning  software  which  incorporates  advanced  WTA 
algorithm  technology  is  a  feasible  goal  for  realistic  problem  sizes.  However,  the  phase  I 
research  has  focused  only  on  parallel  implementation  of  the  core  WTA  algorithm  (ILINE);  in 
order  to  develop  real-time  Battle  Planning  software,  this  core  WTA  must  be  integrated 
successfully  with  parallel  algorithms  for  other  Battle  Planning  functions  (such  as  computation 
of  feasible  intercepts)  or  within  recursive  Batde  Planning  algorithms  such  as  the  adaptive 
preferential  defense  algorithms  or  the  anticipative  algorithms  discussed  in  [2], 

One  potential  direction  for  continuation  of  this  research  into  Phase  II  would  be  to 
extend  the  Phase  I  results  and  develop  an  integrated  parallel  Battle  Planning  algorithm  on  an 
advanced  multiprocessor  architecture  which  incorporates  the  various  Batue  Planning  functions 
which  interact  with  WTA.  This  Batde  Planning  algorithm  could  be  focused  either  on  Boost 
and  Post-Boost  defense  (as  in  [10])  or  on  Midcourse  and  Terminal  defense  (as  in  [2],  [14]). 
The  choice  of  problem  area  will  depend  on  the  criticality  of  parallel  processing  technology  for 
achieving  real-time  performance  in  this  problem;  the  Boost  and  Post-Boost  problem  may  ha\'e 
more  modest  computation  requirements  because  of  its  shorter  time  scale  and  smaller  number  of 
targets  than  the  corresponding  Midcourse  and  Terminal  problems,  but  the  real-time 
computation  cycle  may  be  shorter.  The  goal  of  such  a  Phase  II  effort  would  produce  a 
prototype  Battle  Planning  algorithm  design  (based  on  advanced  WTA  algorithm  technology) 
and  associated  software  which  could  be  used  as  the  bujio  for  Battle  Manager  softv.a;^  and 
processor  design  effort.  Part  of  this  effort  would  involve  selection  of  an  appropriate 
multiprocessor  architecture,  as  well  as  development  of  the  appropriate  parallel  Battle  Planning 
software. 


13 


TR-457 


ALPHATECH,  INC. 


A  second  direction  for  Phase  II  continuation  would  involve  extension  of  the  Phase  1 
work  on  the  core  parallel  ILINE  algorithms  to  produce  advanced  parallel  WTA  algorithms 
capable  of  addressing  important  requirements  such  as  adaptive  preferential  defense,  anticipative 
Battle  Planning,  nuclear  interference  avoidance  and  Battle  Planning  with  discrimination 
uncertainty.  In  [z],  a  theoretical  structure  was  presented  for  incorporating  the  ILINE  algorithm 
into  more  general  recursive  WTA  algorithms  capable  of  addressing  these  imponant  SDS 
requirements.  Funhermore,  extensive  testing  with  sequential  versions  of  these  algorithms 
indicated  that  significant  SDS  effectiveness  improvements  would  result  from  the  use  of  these 
advanced  algorithms.  The  goal  of  this  Phase  II  effort  would  be  to  extend  the  Phase  I  efforts  in 
parallel  designs  for  the  core  ILINE  algorithm  in  order  to  produce  working  prototypes  of  these 
advanced  WTA  algorithms  which  can  be  executed  in  real  time  on  commercial  parallel 
computers.  Such  prototypes  can  be  incorporated  into  future  Command  Center  designs  for 
SDS. 

1.5  ORGANIZATION  OF  THIS  REPORT 

The  remainder  of  this  report  is  of  a  technical  nature,  and  serves  to  document  the 
advances  accomplished  under  phase  I  of  this  research.  In  Section  2,  we  describe  the  variation 
of  the  WTA  problem  which  is  the  focus  of  this  study,  and  discuss  the  ILINE  algorithm.  In 
Section  3,  we  describe  the  design  of  the  various  synchronous  parallel  AUCTION  algorithms 
which  were  implemented  on  different  multiprocessors;  we  also  describe  the  benchmarks 
obtained  on  the  different  multiprocessors.  In  Section  4,  we  overview  the  theory  and  design  of 
the  asynchronous  parallel  AUCTION  algorithms  implemented  on  the  Encore  Multimax,  and 
discuss  the  benchmark  results  obtained  from  our  implementations.  Appendix  A  contains  a 
discussion  of  the  theory  of  the  AUCTION  algorithm,  including  some  new  results  concerning 
the  validity  of  an  asynchronous  variation  of  the  algorithm.  These  results  are  part  of  a  paper 
(26 1  which  will  be  submitted  for  publication. 
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SECTION  2 

THE  ILINE  ALGORITHM  FOR  WTA 

In  this  section,  we  provide  a  mathematical  description  of  the  WTA  problem,  and 
discuss  the  ILINE  algorithm  for  obtaining  a  near-optimal  solution  of  this  problem. 

2.1  MATHEMATICAL  DESCRIPTION  OF  THE  WTA  PROBLEM 

Consider  the  following  target-oriented  weapon-target  assignment  problem.  The 
objective  is  to  minimize  the  weighted  expected  leakage  of  targets  through  the  defense 


(NP) 


min 


T  W 


I V,  n (i-Pij) 

1  =  1  j  =  1 


(2-1) 


where  T  is  the  number  of  targets,  W  is  the  number  of  weapon  farms/platforms,  xjj  is  the 
number  of  interceptors  assigned  from  weapon  farm/platform  j  to  target  i,  pij  is  the  probability 
of  kill  of  an  interceptor  assigned  from  weapon  farm/platform  j  to  target  i,  and  Vj  is  the  value 
associated  with  failure  to  destroy  target  i.  The  constraints  on  problem  (NP)  are 

t  Xij  <  Mj  (2-2) 

i  =  I 


for  all  weapon  farm/platforms  j,  and  to  the  constraint  that  interceptors  are  assigned  in  integer 
quantities;  that  is. 


xij  e  {  0,  1,  Mj  )  (2-3) 

The  problem  NP  subject  to  the  constraints  of  Eqs.  2-2,  2-3  is  a  nonlinear  integer 
programming  problem;  a  recent  result  by  Lloyd  and  Witscnhausen  [3]  established  that  this 
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problem  NP-complete'.  A  simpler  version  of  this  problem  introduces  the  additional  constraint 

w 

X  ^.J  -  1  (2-4) 

j  =  t 

With  this  additional  constraint,  problem  (NP)  becomes  equivalent  to  the  following  problem: 

T 

(LP)  max  X  Pij  ^ij  (2-5) 


subject  to  the  constraints  of  Eqs.  2-2,  2-3,  2-4.  Problem  (LP)  is  a  linear  integer  programming 
problem  of  network  type,  for  which  efficient  algorithms  exist. 

Figure  2-1  illustrates  the  stmcture  of  the  resulting  linear  integer  programming  problem. 
This  type  of  linear  program  is  known  as  a  transportation  problem.  In  essence,  a  maximizing 
set  of  flows  Xij  must  be  found  between  a  set  of  source  nodes  (representing  the  targets  in  our 


Sources  Objects 


Figure  2- 1 .  The  structure  of  transportation  problems  divides  the  graph  into  two 
sets  of  nodes  (target  nodes  Tj  and  farm/platform  nodes  Wj)  with  arcs 
in  between. 


'  This  implies  that  the  lime  required  to  find  an  optimal  solution  is  likely  to  grow  exponentially  in  the  numbers 
ol  weapons  and  targets  in  the  problem. 
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set  of  flows  Xjj  must  be  found  between  a  set  of  source  nodes  (representing  the  targets  in  our 
problem)  and  a  set  of  sink  nodes  (representing  the  interceptor  platfomis  and  fiirms).  'fhe 
overall  flows  must  satisfy  the  conservation  of  flow  constraints  (cf.  Eqs.  2-2  to  2-4),  so  that  the 
overall  flow  out  of  a  target  source  cannot  exceed  1,  and  the  overall  flow  into  weapon  object) 
cannot  exceed  its  available  interceptor  inventory  Mj.  When  all  of  the  available  weapon 
inventories  Mj  are  equal  to  one,  the  resulting  optimization  problem  is  known  as  an  assignment 
problem;  in  this  case,  each  interceptor  is  modeled  as  a  separate  weapon  platform. 

The  stmcture  of  the  constraints  of  transportation  problems^  is  such  that  the  integrality 
constraints  of  Eq.  2-3  can  be  relaxed  to  allow  for  fractional  interceptor  assignments  Xjj.  That 
is,  Eq.  2-3  can  be  replaced  by  the  constraints 

0  <  Xij  <  Mj  (2-6) 

With  this  relaxation,  an  optimal  solution  can  be  found  for  which  all  of  tlie  Xjj  are  integer.  This 
allows  for  the  development  of  efficient  algorithms  by  using  the  duality  theor>'  of  linear 
programming,  One  such  efficient  algorithm  is  the  AUCTION  algorithm  developed  by 
Bertsekas  [16]  for  assignment  problems  and  eAtended  Bensekas  and  Casta.'ion  118]  for 
transportation  problems.  We  overview  the  AUCTION  and  ILINE  algorithms  in  the  next 
subsections. 

2.2  DESCRIPTION  OF  THE  ILINE  ALGORITHM 

The  basis  of  ILINE  is  to  solve  Problem  (NP)  by  a  successive  linearization  pnx  edure. 
whereby  Problem  (NP)  is  approximated  at  each  stage  by  Problem  (LP).  Tie  solution  of 
Problem  (LP)  is  computed  using  AUCTION,  and  a  fixed  number  of  the  assignments  are 
implemented.  Based  on  these  assignments,  a  new  linearized  version  of  Problem  (NP)  is 
generated  (a  new  Problem  LP).  and  the  procedure  is  repeated  until  all  interceptors  have  been 
assigned.  Figure  2-2  illustrates  the  stmcture  of  the  ILINE  algorithm.  The  key  computation- 

'  'Hus  structure  is  known  as  unimodulaiiiy  (i.S|, 
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intensive  step  is  the  solution  of  Problem  (LP),  which  must  be  perfomied  several  times  in  the 
procedure.  The  AUCTION  algoritfim  provides  a  practical  approach  for  repeated  solutions  of 
Problem  (LP),  by  reusing  most  of  the  previous  solution  as  an  initial  point  for  obtaining  a  new 
solution. 


t 


Figure  2-2.  Structure  of  the  ILINE  Algorithm 


At  each  iteration  of  the  ILINE  algorithm,  a  subset  of  interceptor  assignments  x*ij  have 
already  been  frxed.  Based  on  these  fixed  assignments,  the  ILINE  algorithm  computes  an 
expected  probability  of  survival  for  each  target  i,  as 

w 

p.s(i)  =  ric -p.,)"'''  0-7) 

j-i 


The  linearization  of  Problem  NP  is  ba.scd  on  using  the  expected  probabilities  of  survival  for 
each  target,  resulting  in  the  following  optimization  problem: 
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T 

(Si.p)  max  X  p  .j  X  ,j 

X  ;  .  i  t 


(2-8) 


subject  to 

X  <  ■’^.j  )-  j  (2-9) 

I  -  ! 

and  the  constraints  of  Eqs.  2-3  and  2-4. 

Denote  the  optimal  solution  to  Problem  (SLP)  by  x°ij.  For  each  pair  ij,  the  ILINE 
algorithm  ranks  the  nonzeiu  assignments  (xO,j  >  0)  in  nonincreasing  order  according  to  the 
marginal  return  pij  Ps(i)  Vj .  The  top  k  assignments  according  to  this  order  are  selected  and 
added  to  the  corresponding  permanent  assignments  x*ij.  If  additional  interceptors  remain  to  be 
assigned,  then  a  subsequent  iteration  of  the  above  procedure  is  conducted. 

The  key  operation  of  the  ILINE  algorithm  is  the  optimal  solution  of  the  linearized 
Problems  (SLP).  The  algorithm  used  inside  of  ILINE  is  a  variation  of  the  AUCTION 
algorithm,  discussed  next. 

2.3  thf:  auction  AL(;()RnnM 

The  original  AUCTION  algorithm  was  described  by  Bertsekas  1 16]  for  assigning 
individual  bidders  (corresponding  to  interceptors  or  targets)  to  individual  objects 
(corresponding  to  targets  or  interceptors).  The  theory  of  the  AUCTION  algorithm  is  discussed 
in  detail  in  Appendix  A.  In  this  subsection,  we  brieny  oven-dew  the  computations  of  the 
A  U  C  r I O N  a  1  gori  t h  m . 

The  classical  assignment  problem  consists  of  finding  a  one-to-one  match  between  a  list 
oi  n  persons  and  n  objects  such  that  the  sum  of  the  Ixmefits  of  the  individual  matches  is 
maximized.  Denote  the  individual  benefits  of  assigning  person  i  to  object  i  as  ajj.  Then,  the 
chissical  assignment  problem  can  be  suited  a>  ioiiows. 
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II 

max 

X,,  in 


suhiect  to 

n 

=  =  ’ . "• 

1  - 1 

n 

^  x,j  =  1.  i  =  1,  ...,  n;  (2-12) 

j  I 

Xjj  e  {0,1  !,  i  =  1,  n;  j  =  1 .  ii.  (2-12) 

Note  the  similarity  between  the  objective  in  the  WTA  problem  of  Ecj.  2-S  and  the  objective  in 
the  classical  assignment  problem:  the  benefit  ajj  of  assigning  interceptor  j  to  target  i  is  given  by 
Pij  Ps(0-  Note  also  that  the  constraints  in  Eqs.  2-12  and  2-13  leqniie  that  an  etjua!  number 
of  interceptors  and  targets  be  present.  This  represents  no  loss  in  generality,  since  targets  with 
value  0  or  interceptors  with  0  probability  of  kill  can  be  introduced  to  balance  an  uneven 
assignment  problem. 

Ideally,  the  maximum  benefit  is  obtained  when  each  person  i  is  assigned  to  an  object  j 
offering  maximal  individual  benefit  aij.  However,  such  an  tissignment  is  likely  to  violate  the 
constraints  in  Eq.  2-12  which  require  that  each  target  be  assigned  an  interceptor.  In  order  to 
resolve  such  conflicts,  the  AUCTION  algorithm  assigns  a  price  pj  to  each  object  j  which 
reflects  the  degree  to  which  an  object  is  in  demand  by  different  persons.  The  key  obsercaition 
in  the  AUCTION  algorithm  is  that  there  exists  a  set  of  prices  pj  such  that  r/ic  opiinuil 
assifinment  has  the  property  that  each  person  i  is  assigned  to  ihe  object  j(i )  which  offers  the 
highest  net  profit  aij(,)  -  pj(i)  =  maxj  (ajj  -  pj).  This  is  a  conseciuence  cif  the  celebrated  dualitc 
theorem  of  linear  programming  [.5].  The  AUd’lON  tilgonthm  consists  of  a  search  for  the  right 

I  ft  ■ 
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level  of  object  prices  pj;  tliis  search  takes  the  form  of  an  auction,  where  unassigncd  persons 
"bid"  for  objects  and  raise  the  prices  of  the  objects  accordingly. 

The  AUCTION  algorithm  can  be  described  in  terms  of  a  sequence  of  iterations.  During 
each  iteration,  the  price  pj  of  some  object  j  is  raised;  in  addition,  tentative  assignments  of 
objects  to  persons  which  have  offered  the  highest  prices  for  those  objects  are  made.  Each 
iteration  can  be  described  in  terms  of  two  distinct  phases: 

a.  Bid  Phase:  In  this  phase,  a  subset  I  of  persons  which  do  not  have  a  tentative 
assignment  (unassigned  persons)  to  any  objects  will  offer  bids  for  objects.  Each 
person  i  computes  his  bid  as  follows,  based  on  the  current  object  prices  pj. 

1 .  Person  i  must  determine  the  object  j(i)  offering  the  maximum  net  profit  based  on 
the  current  prices;  that  is. 


j(i)  =  arg  maxj  {aij  -  pj} 


2 .  Person  i  must  determine  the  price  level  b(i)  which  it  will  bid  for  object  j(i);  this 
price  level  is  determined  by  computing  the  two  highest  net  profit  levels  as  follow's: 

v(i)  =  maxj  {ajj-pj) 

w(i)  =  maxj;^j(i)  {aij  -pj} 

t>(i)  =  Pj(i)  +  v(i)  -  w(i)  +  e 

where  e  >  0  is  a  positive  parameter,  chosen  small  enough  to  gutirantee  convergence 
to  an  optimal  solution. 

b.  Auction  Phase:  In  this  phase,  each  object  j  which  received  a  bid  in  the  Bid  Pha.se 
selects  the  highest  bid  and  is  tentatively  assigned  to  the  person  i  which  offered  the 
highest  bid.  If  the  object  was  previously  assigned  to  a  different  person  i',  this 
assignment  is  deleted,  so  peison  i'  will  become  unassigned  for  the  next  iteration. 
This  auction  process  is  summarized  below'. 

For  each  object  j,  define  the  set  I(j)  =  (i  e  II  j(i)  =  j  )  to  be  the  set  of  bidders 
currently  bidding  for  object  j.  If  I(j)  =  O  (the  empty  set),  leave  pj  unchanged  and 
xjj  unchanged,  i  =  1,  ....  n.  If  I(j)  ^  O,  update  the  price  of  object  j  as 

Pj  =  maxj  g  i,j)  b(i) 

If  object  j  was  previously  assigned  to  person  i'  (i.e.  xj'j  =  1 ).  remove  that 
assignment  (i.e.  set  xj'j  =  0).  Assign  object  j  to  one  of  the  persons  offering  the 
highest  bid  for  object  j;  that  is, 

i*(j)  -  arg  maxj  f  iq)b(i) 


■Set  X|», PI  -  1 
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The  above  bid  and  auction  steps  are  repeated  until  each  person  is  assigned  to  an  object. 
As  discussed  in  Appendix  A,  proper  choice  of  the  constant  e  is  required  for  this  procedure  to 
converge  to  an  oprimal  assignment.  In  particular,  if  the  benefits  ajj  are  all  integer,  the  constant 
e  must  be  chosen  to  be  smaller  than  1/n,  where  n  is  the  total  number  of  persons.  For  integer 
benefits  ajj,  by  scaling  all  of  the  benefits  by  multiplication  by  (n+1),  the  AUCTION  algorithm 
can  be  conducted  using  only  integer  arithmetic.  This  was  the  approach  used  in  our 
implementations. 

An  important  issue  which  affects  algorithm  performance  on  different  multiprocessor 
architectures  is  the  selection  of  data  structures  for  the  implementation  of  the  AUCTION 
algorithm.  Specifically,  there  are  many  WTA  problems  where  certain  interceptor-target 
assignments  are  known  to  be  infeasible  and  should  not  be  represented  as  part  of  the  problem. 

In  the  assignment  problem,  this  is  represented  by  a  set  A(i)  of  admissible  objects  for  the 
assignment  of  person  i.  Thus,  the  assignment  xjj  =  0  unless  j  e  A(i).  The  sets  A(i)  can  be 
represented  explicitly  using  sparse  data  structures,  or  they  can  be  represented  implicitly  by 
selecting  the  benefit  aij  =  -oo  for  j  «  A(i)  and  using  dense  data  structures.  For  sequential 
compulation,  sparse  data  structures  provide  a  considerable  advantage  over  dense  data 
Structures;  for  parallel  computation,  use  of  sparse  data  structures  may  require  interprocessor 
movement  of  data  which  can  reduce  efficiency. 

Note  that  any  nonempty  subset  I  of  unassigned  persons  may  submit  a  bid  at  each 
iteration.  This  gives  rise  to  a  variety  of  possible  implementations,  named  after  their  analogs  in 
relaxation  and  coordinate  descent  methods  for  solving  systems  of  equations  or  unconstrained 
optimization  problems  (see  e.g.[27,281): 

a.  fhe  Jacobi  implementation,  where  I  is  the  set  of  all  unassigned  persons  at  the 
beginning  of  the  iteration. 

b.  The  dauss-Seidel  implementation,  where  I  consists  of  a  single  person,  who  is 
unassigncd  at  the  beginning  of  the  iteration. 

c.  llie  f>lock  Gaiiss-Scidcl  implementation,  where  1  is  a  subset  of  the  .set  of  all 
unassigned  persons  at  the  beginning  of  the  iteration.  ('I'he  method  for  chex^sing  the 
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persons  in  the  subset  1  may  vary  from  one  iteration  to  the  next,  so  this 
implementation  contains  the  preceding  two  as  special  cases.) 

Generally,  in  a  serial  computation  environment,  the  Gauss-Seidel  implementation  tends 
to  be  the  fastest,  but  with  a  parallel  machine,  the  choice  is  unclear  because  all  the  bids  of  the 
persons  in  I  may  be  calculated  in  parallel.  It  is  important  to  consider  all  these  different  versions 
because  they  provide  starting  points  for  different  synchronous  and  asynchronous  parallel 
implementations. 

Figure  2-3  illustrates  the  Gauss-Seidel  variation  of  the  AUCTION  algorithm.  In  this 
variationv  an  unassigned  bidder  is  selected  from  a  queue;  this  bidder  selects  the  most  desirable 
object  (based  on  the  object's  perceived  value  and  its  price)  and  selects  a  bid  price  for  this  object 
which  outbids  every  other  bidder  by  as  much  as  possible.  Thus,  if  in  a  previous  iteration 
another  bidder  had  successfully  bid  for  this  object,  this  bidder  is  now  rejected  and  joins  the 
bidders'  queue  for  future  iterations.  The  auction  proceeds  until  the  bidders'  queue  is  empty. 


Unassigned 

Person 

Queue 


Figure  2-3.  Structure  of  the  Gauss-Seidel  variation  of  the  AUCTION  Algorithm 


The  Jacobi  variation  of  the  AUCTION  algorithm  is  similar,  but  assumes  that  all  of  the 
bidders  on  the  bidding  queue  bid  simultaneously;  thus,  an  object  may  be  bid  on  by  more  than 
one  bidder  at  a  time.  In  contrast  with  the  Gauss-Seidel  algorithm,  a  bidder  is  no  longer  assured 
of  winning  his  bid,  since  other  bidders  may  bid  on  tlic  .same  object  at  the  same  time.  Similarly, 
after  all  of  the  bidders  have  completed  tlteir  bid,  the  objects  are  awarded  to  the  bidder  with  the 
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highest  offered  price,  and  a  new  round  of  bidding  is  initiated.  Figure  2-4  illustrates  the 
structure  of  the  Jacobi  variation;  in  this  figure,  the  number  of  parallel  bidders  p  is  equal  to  the 
number  of  unassigned  persons  in  the  unassigned  person  queue.  Figure  2-4  also  represents  the 
block  Gauss-Seidel  variation  when  the  number  of  parallel  bidders  is  selected  to  be  a  number  p 
which  is  smaller  than  the  number  of  unassigned  persons  in  the  queue. 

Sequential  implementations  of  the  Gauss-Seidel  and  Jacobi  variations  have  shown  that 
the  Gauss-Seidel  variation  is  15-20%  faster.  In  both  vanations,  the  key  computation-intensive 
step  is  the  computation  of  new  bids  for  each  bidder.  In  the  Gauss-Seidel  variation,  this  step 
encompasses  over  95%  of  the  total  computation  time  of  the  AUCTION  algorithm.  In  the 
Jacobi  variation,  the  awarding  of  new  auctions  is  harder,  so  the  computation  of  new  bids 
comprises  only  85%  of  the  total  computation  time. 


Figure  2-4.  Structure  of  the  Jacobi  variation  of  the  AUCTION  algorithm.  In  this 
figure,  p  is  the  number  of  unassigned  persons  present  in  the 
unassigned  persons  queue. 


2.4  VARIATIONS  OF  AUCTION  FOR  WTA  PROBLEMS 

The  original  AUCTION  algorithm  was  designed  to  solve  assignment  problems,  which 
correspond  to  WTA  problems  when  the  interceptor  platfomi  inventories  are  all  unifonnly  equal 
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to  one.  For  W'FA  problems,  weapon  platforms  often  carry  more  than  one  interceptor,  so  the 
platform  inventories  are  larger  than  1 ,  and  there  are  fewer  weapons  W  than  targets  T.  This 
asymmetry  can  be  exploited  to  yield  more  efficient  algoritlims;  the  theory  of  these  algorithms  is 
described  in  Bertsekas  and  Castanon  [18].  It  also  creates  variations  of  the  AUCTION 
algorithm  with  different  structure,  depending  on  whether  we  assign  the  targets  to  be  the 
persons,  or  whether  we  assign  the  weapons  to  be  the  persons.  The  choice  of  variation  for 
parallel  processing  depends  on  the  level  of  parallelism  which  one  is  interested  in  exploiting. 

For  medium-grained  parallelism  using  shared-memory  MIMD  processors,  the  structure 
of  Fig.  2-4  appears  more  amenable  for  parallel  processing  than  the  stmcture  of  Fig.  2-3.  In 
this  structure,  bid  tasks  for  separate  persons  can  be  executed  in  parallel;  similarly,  auction  tasks 
for  separate  objects  can  be  executed  in  parallel.  However,  an  important  limit  in  the  amount  of 
parallelism  which  can  be  obtained  from  this  approach  is  the  average  length  of  the  unassigned 
persons  queue.  This  limits  the  number  of  parallel  bid  tasks,  which  in  turn  limits  the  number  of 
parallel  auction  tasks.  Figure  2-5  shows  a  typical  histogram  of  the  queue  length  for  a 
sequential  implementation  of  the  Jacobi  AUCTION  algorithm  as  a  function  of  the  number  of 


No. 

Figure  2-5.  I^nuth  ot  I inassiened  Persons  Queue  versus  iteration  number  for 
Jaecihi  ,Al  (''ri()N. 
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iterations  (the  test  problem  involved  100  persons).  As  Fig.  2-5  indicates,  the  average  speedup 
obtainable  by  this  approach  is  limited  to  near  3-4,  because  of  the  dynamic  load  imbalance 
across  iterations. 

For  medium-grained  parallelism  using  multiple  bid  tasks,  the  variation  of  the 


AUCTION  algorithm  which  should  be  most  successful  is  one  which  maximizes  the  length  of 
the  unassigned  persons  queue  across  iterations.  This  is  accomplished  by  selecting  the  targets 
as  bidders,  since  there  are  more  targets  than  weapons,  leading  to  longer  average  queue  lengths. 
Either  the  block  Gauss-Seidel  or  the  Jacobi  variation  of  the  AUCTION  algorithms  would  then 
be  used,  depending  on  the  available  number  of  processors  and  the  overhead  required  for 
interprocessor  synchronization.  When  synchronization  overhead  is  high,  an  asynchronous 
implementation  of  the  AUCTION  algorithm  may  be  preferred;  the  theory  of  such  an 
asynchronous  implementation  is  described  in  Appendix  A. 

For  fine-grained  parallelism  using  SIMD  architecttires,  the  structure  of  Fig.  2-3  is 
superior  to  the  structure  of  Fig.  2-4.  In  the  Gauss-Seidel  variation,  most  of  the  time  is  spent  in 
the  computation  of  individual  bids.  This  operation  is  similar  to  finding  a  maximum  value  and 
maximum  element  of  a  list  of  objects.  The  amount  of  parallel  work  increases  with  the  length  of 
the  object  lists.  Thus,  the  preferred  variation  of  the  AUCTION  algorithm  for  exploiting  fine¬ 
grained  parallelism  is  to  use  a  Gauss-Seidel  variation,  with  weapons  as  persons.  In  this 
manner,  the  number  of  objects  (corresponding  to  targets)  is  increased,  thereby  increasing  the 
size  of  the  bid  tasks  for  fine-grained  parallelism. 

In  the  subsequent  sections,  we  describe  the  design  of  the  various  parallel  AUCTION 
algorithm  variations  developed  under  this  contract. 


■6 


in  457 


ALPHATECH,  INC. 


SECTION  3 

SYNCHRONOUS  PARALLEL  AUCTION  ALGORITHMS 

3.1  INTRODUCTION 

In  this  Section,  we  overview  the  designs  of  the  various  synchronous  parallel 
AUCTION  algorithm  implementations,  and  discuss  the  benchmarking  results  obtained.  We 
first  discuss  the  parallel  AUCTION  algorithms  designed  for  MIMD  architectures  (the  Encore 
Multimax  and  the  AUiant  FX/8),  Several  parallel  AUCTION  algorithms  were  developed  and 
benchmarked;  these  algorithms  differ  in  the  degree  to  which  fine-grained  and  coarse-grained 
parallelism  is  used.  In  later  subsections,  we  discuss  the  parallel  AUCTION  algorithms 
designed  for  SIMD  architectures  (DAP  510  and  CM-2);  these  algorithms  were  based  on 
exploiting  fine-grained  parallelism,  and  are  similar  in  design  across  the  different 
multiprocessors.  The  benchmarking  results  illustrate  the  differences  in  performance  which  can 
be  achieved  on  different  multiprocessor  architectures. 

3.2  SYNCHRONOUS  AUCTION  ALGORITHMS  ON  THE  ENCORE 

MULTIMAX 

In  synchronous  shared  memory  implementations  of  the  AUCTION  algorithm,  all 
bidding  and  assignment  phases  are  separated  by  a  synchronization  point.  There  are  two  basic 
ways  to  parallelize  the  bidding  phase  for  the  set  of  unassigned  persons  I  and  a  combination  of 
the  two: 

a .  Parallelization  across  bids  (or  Jacobi  parallelization):  Here  the  calculations  in\’ol\'ed 
in  the  bid  of  each  person  i  e  I  are  carried  out  by  a  single  processor,  if  the  number 
of  persons  in  I,  call  it  III,  exceeds  the  number  of  processors  p,  some  processors  will 
execute  the  calculations  involved  in  more  than  one  bid.  (This  will  typically  happen 
in  the  early  stages  of  a  Jacobi-type  algorithm  where  I  is  the  set  of  all  unassigned 
persons.)  If  111  <  p,  then  p  -  III  processors  will  be  idle  during  the  bidding  phase, 
thereby  reducing  efficiency.  This  will  typically  happen  in  tiie  late  stages  of  a 
J ac ob  i  - 1  >'  [re  a  1  gori  t h  m . 
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b.  Parallelization  within  a  bid  (or  Gauss-Seidel  parallelization):  Here  the  set  I  consists 
of  a  single  person  as  in  the  Gauss-Seidel  implementation.  The  calculations 
involved  in  the  bid  of  each  unassigned  person  i  are  shared  by  the  p  processors  of 
the  system.  Thus  the  set  of  admissible  objects  A(i)  is  divided  in  p  groups  of  objects 
Ai(i),  A2(i), ...,  Ap(i).  The  best  object,  best  value,  and  second  best  value  are 
calculated  within  each  group  in  parallel  by  a  separate  processor.  We  call  the 
calculations  within  a  group  a  search  task.  After  all  the  search  tasks  are  completed  (a 
synchronization  of  the  processors  is  required  to  check  this)  the  results  are  "merged" 
by  one  of  the  processors  who  finds  the  best  value  over  all  best  group  values,  while 
simultaneously  computing  the  corresponding  best  object  and  size  of  bid.  (It  is 
possible  to  do  the  merging  in  parallel  using  several  processors,  but  this  is  inefficient 
when  the  number  of  processors  is  small,  as  it  was  in  our  case,  because  of  the  extra 
synchronization  and  other  overhead  involved.)  The  drawback  of  this  method  over 
the  preceding  one  is  that  it  typically  requires  a  larger  number  of  iterations,  since 
each  iteration  involves  a  single  person.  Even  though  each  Gauss-Seidel  iteration 
may  take  less  time  because  it  is  executed  by  multiple  processors  in  parallel,  the 
synchronization  overhead  is  roughly  proportional  to  the  number  of  iterations. 

c .  Hybrid  approach  ( or  block  Gauss-Seidel  parallelization):  In  this  approach,  the  bid 
calculations  of  each  person  are  parallelized  as  in  the  preceding  method,but  the 
number  of  processors  used  per  bid  is  k,  where  l<k<p.  We  will  assume  that  k 
divides  evenly  p,  so  we  can  compute  tne  bids  of  p/k  persons  in  parallel,  assuming 
enough  unassigned  persons  are  available  for  the  iteration  (III  >  p/k).  With  proper 
choice  of  k,  this  method  combines  the  best  features  and  alleviates  the  drawbacks  of 
the  preceding  two. 

Once  the  bidding  phase  of  an  iteration  is  completed  (a  synchronization  point),the 
assignment  phase  is  executed.  This  phase  is  typically  carried  out  by  a  single  processor  in  our 
synchronous  implementations.  While  it  is  possible  to  consider  using  multiple  processors  to 
execute  the  assignment  phase  in  parallel,  the  potential  gain  from  parallelization  is  modest  while 
the  ass(x:iated  overhead  more  than  offsets  this  gain  in  our  sy.stem. 

In  the  subsequent  subsections,  we  de.scribe  the  designs  and  benchmark  results  obtained 
from  diiTcrent  parallel  AUCTION  algorithm  designs  for  the  Encore  Multimax  ba.sed  on  Gauss- 
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Gauss-Seidel  parallelism,  Jacobi  parallelism  and  Block  Gaiiss-Seidel  parallelism.  All 
algorithms  were  coded  in  Fortran  77  using  the  same  sparse  data  structures. 

3.2. 1  Gauss-Seidel  AUCTION  Algorithm 

The  synchronous  Gauss-Seidel  AUCTION  algorithm  generates  a  single  bid  at  a  time, 
and  uses  multiple  processors  to  search  the  possible  objects  n  order  to  generate  that  bid.  The 
premise  of  the  parallel  Gauss-Seidel  AUCTION  algorithm  is  to  use  multiple  processors  to 
reduce  the  computation  time  associated  with  computing  each  bid.  The  flexibility  of  a  shared- 
memory  MIMD  architecture  allows  for  the  efficient  use  of  sparse  data  structures.  The  parallel 
algorithm  design  includes  synchronization  in  order  to  guarantee  that  the  bids  generated  by  the 
parallel  algorithm  are  independent  of  the  number  of  processors  used,  and  thus  represent  a 
faithful  replication  of  the  sequential  Gauss-Seidel  AUCTION  algorithm. 

Figure  3-1  illustrates  the  percentage  of  the  total  computation  time  which  is  spent  in 
searching  the  list  of  admissible  objects  A(i)  for  several  1000  person  assignment  problems  with 
varying  degrees  of  sparsity.  As  Fig.  3-1  illustrates,  the  sequential  Gauss-Seidel  AUCTION 
algorithm  spends  between  92-99%  of  its  computation  time  (depending  on  the  problem  size 
and  the  density  of  feasible  assignments)  searching  the  list  A(i).  This  percentage  increases  with 
the  average  number  of  elements  in  the  admissible  assignments  A(i),  so  that  greater  speedups 
are  possible  for  larger  problems. 

The  design  of  the  synchronous  Gauss-Seidel  AUCTION  algorithm  is  illustrated  in  Fig. 
3-2.  The  majority  of  the  AUCTION  algorithm  is  conducted  on  a  single  processor  (called  the 
parent  processor).  Multiple  processors  are  used  to  assist  the  parent  processor  in  computing 
each  bid  in  parallel  using  a  "divide  and  conquer”  strateg>;  each  processor  is  assigned  to  search 
a  fixed  part  of  the  list  of  objects  A(i)  which  can  be  assigned  to  person  i.  Two  synchronization 
points  are  included  in  each  bidding  iteration.  The  first  synchronization  point  is  a  barrier  (based 
on  the  barrier  monitor  developed  at  ANL/MCS  |29|)  which  serves  to  delay  the  start  of  the 
search  of  admissible  objects  until  the  previous  price  update  is  completed.  The  second 
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Average  percent  of  Objects  in  A(i) 

Figure  3-1.  Percentage  of  total  Gauss-Seidel  AUCTION  computation  time  spent  in 
searching  the  lists  of  admissible  objects  for  1000  person  assignment 
problems,  benefit  range  1-1000,  as  a  function  of  the  density  of 
feasible  assignments. 


Figure  3-2.  Design  of  the  parallel  synchronous  Gauss-Seidel  AUCTION 
algorithm.  Multiple  processors  are  used  to  search  the  list  of 
admissible  objects  for  a  person;  the  results  of  the  searches  are  merged 
to  compute  a  person's  bid,  and  the  rest  of  the  bid  and  auction  cycles 
tu'c  conducted  by  a  single  processor. 
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synchronization  point  is  a  monitor  which  is  an  extension  of  the  Argonne  monitors  for  portable 
parallel  programming  [29].  The  merge  search  monitor  allows  each  processor,  upon  completion 
of  its  search  of  Ai^(i),  to  merge  the  results  of  its  search  (the  highest  and  second  highest  net 
profit  levels  in  the  sublist,  and  the  object  which  provided  the  highest  net  profit  level)  with  the 
results  of  other  processors  which  have  completed  their  search,  and  then  proceed  to  a  barrier  to 
wait  for  all  of  the  processors  to  complete  their  search.  The  monitor  sequences  the  merging  of 
the  processor  searches  to  guarantee  that  the  results  of  the  merged  search  are  identical  with  the 
one-processor  Gauss-Seidel  algorithm. 

In  order  to  understand  the  potential  performance  of  the  parallel  Gauss-Seidel 
AUCTION  algorithm,  we  have  constructed  an  empirical  model  for  the  computation  time  per 
iteration  with  p  processors  per  bid.  This  time  is  given  by 

T(p)=S(p)+M(p)+C(p)+V 

where 

S(p)=  Time  for  completing  the  search  tasks 
M(p)=  Time  for  merging  the  results  of  search  tasks 
C(p)=  Time  for  synchronization 
V=  Constant  overhead  per  iteration. 

Let  us  assume  for  convenience  that  each  set  of  admissible  objects  A(i)  has  the  same 
number  of  elements,  say  n.  By  counting  the  number  of  operations  and  by  assuming  perfect 
load  balancing  between  the  search  tasks  (i.e.,  an  equal  number  of  objects  n/p  in  each  of  the 
groups  Ai(i),  ...,  Ap(i),  we  have  estimated  roughly  that  the  search  time  per  iteration  is 

S(p)  =  Constant  •  (n/p  +  log(n/p)  +  log(log(n/p))  (3-1) 

(The  logarithmic  teniis  account  for  the  calculations  involving  the  second  best  value.)  The 
merging  time  is  proportional  to  p.  while  the  synchronization  time  using  software  harriers  is 
roughly  proportional  to  p. 
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It  can  be  seen  that,  given  n,  there  is  an  optimal  value  of  p  that  minimize  the  total  time 
per  iteration.  For  example,  if  p  is  large,  the  increase  of  the  synchronization  and  merging  times 
may  offset  the  potential  gains  from  parallelization  of  the  search  tasks.  Because  of  various 
constants  involved  in  the  p  receding  estimates  of  the  search,  merging,  and  synchronization 
times,  it  is  difficult  to  estimate  a  priori  the  optimal  value  of  p  required  to  solve  the  problem. 

Figure  3-3  illustrates  the  performance  of  the  synchronous  Gauss-Seidel  AUCTION 
algorithm  for  a  1000  pierson,  20%  dense  assignment  problem  with  benefits  in  the  range 
[1,1000].  All  of  the  times  reported  in  the  figure  are  measured  in  terms  of  the  parent  processor 
(the  processor  which  executes  the  sequential  part  of  the  algorithm).  The  scan  time  is  the  time 
which  the  parent  processor  (processor  1)  spiends  in  searching  its  part  of  the  admissible  object 
lists  Ai(i).  The  predicted  relationship  between  scan  time  and  the  number  of  processors  is 
derived  from  Eq.  3-1.  The  synchronization  time  for  a  bid  by  person  i  is  measured  as  the  time 
from  which  the  parent  processor  finished  scanning  v.,e  subset  of  objects  Ai(i)  until  the  time  the 
parent  processor  is  released  from  the  merge  search  monitor  to  continue  with  the  auction 
process.  As  the  results  of  Fig.  3-3  indicate,  the  achievable  speedup  for  this  problem  is  limited 
to  a  factor  of  nearly  3  because  of  the  increase  in  synchronization  time  required  to  merge  the 
results  of  the  various  searches.  This  factor  will  increase  as  the  number  of  elements  in  the  sets 
A(i)  increa.ses. 

Figure  3-4  illustrates  the  conjectured  theoretical  behavior  of  the  total  scan, 
synclironization  and  computation  times,  based  on  fitting  the  models  de.scribed  in  the  previous 
section  with  appropriate  constants  to  match  the  problem  size.  Note  the  close  correspondence 
between  the  predictions  of  Fig.  3-4  and  the  empirical  results  of  Fig.  3-3.  The  only  minor 
discrepancy  is  that  the  empirical  synchronization  time  grows  superlinearly  with  the  number  of 
processors;  this  is  probably  due  to  increased  contention  for  ticcess  to  critical  sections  in  the 
barrier  and  merge  .search  monitors.  Similar  phenomena  were  observed  bv  Dntz  and  Hovle  |.^()| 
in  their  experiments  using  the  Fincore  Multimax. 
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3-3.  Performance  of  the  Synchronous  Gauss  Seidel  AUCTION  algorithm 
on  the  Encore  Multimax  as  the  number  of  processors  increases  for  a 
1000  person,  20%  dense  assignment  problem  with  benefit  range 
[1,1000].  Note  the  growth  in  merge  and  synchronization  time 
required  as  the  number  of  processors  increases.  This  limits  the 
maximum  speedup  to  a  factor  of  approximately  3. 


I'igure  3-4.  Theoretical  lx:havior  of  .synchronous  Gauss-Scidcl  AliCl'ION 

algorithm  with  increasing  number  of  pioccssors.  'i'he  consttuits  have 
b<xn  matched  to  fit  the  limes  of  a  KXX)  person,  209<  dense  assignment 
problem  with  benefit  range  |  I.IOOO]. 
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Figure  3-5  illustrates  the  effective  speedup  achieved  by  the  parallel  (iauss-Seidcl 
AUCTION  algorithm  as  a  function  of  the  density  of  feasible  assignments  for  an  XOO  person 
assignment  problem.  As  the  density  decreases,  the  potential  for  parallel  work  decreases  also 
However,  for  denser  problems,  spteedups  approaching  factors  of  6  are  possible  using  up  to  10 
processors.  Figure  3-6  illustrates  similar  results  for  larger,  KX.K)  person  assignment  proisi.*.-.i> 
Note  that  the  sequential  computation  time  for  this  larger  problem  has  nearly  doubleel.  3'his 
increase  in  computation  time  is  due  to  the  an  increase  in  the  number  of  feasible  assignments 
which  must  be  considered  in  the  problem  (which  has  also  nearly  doubled,  from  640,0(X)  to 
1,000,(XX)  for  fully  dense  problems);  the  empirical  computation  time  grows  near-linearly  with 
the  number  of  feasible  assignments  to  be  considered.  For  the  larger  1000  person  assignment 
problem,  a  speedup  of  nearly  6.7  was  achieved  for  the  fully  dense  problem. 


figure  ?i  5.  Cornpanson  of  best  ptaallel  and  sequential  times  for  Gauss-Seidel 

AUCTION  algorithm  on  Encore  Multimax  for  800  person  tissigrinicni 
problem,  benefit  range  1 1,1  (KK)]  as  a  function  of  the  density  of 
feasible  assignments.  The  maximum  number  of  prexessors  used  for 
the  ptirallel  Gauss-Seidel  AUCl’ION  algorithm  was  10  processors  for 
the  fully-denso  problem. 
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Figure  3-6.  Comparison  of  best  parallel  and  sequential  times  for  Gauss-Seidel 
AUCTION  algorithm  on  Encore  Multimax  for  1000  person 
assignment  problem,  benefit  range  [1,1000]  as  a  function  of  the 
density  of  feasible  assignments.  The  maximum  number  of  processors 
used  for  the  parallel  Gauss-Seidel  AUCTION  algorithm  was  10 
processors  for  the  fully-dense  problem. 


As  Figs.  3-5  and  3-6  indicate,  the  potential  speedup  on  MIMD  tirchitectures  for  the 
Gauss-Seidel  AUCTION  algorithm  depends  critically  on  the  density  of  the  feasible 
assignments  (the  speedup  depends  on  the  average  number  of  feasible  assignments  for  each 
person,  which  is  the  product  of  the  density  times  the  total  number  of  objects).  For  many  WTA 
problems,  we  expect  the  density  of  feasible  assignments  to  be  in  the  10-70%  range.  This  limits 
the  overall  speedup  for  1000  interceptor  assignment  problems  to  factors  Ixuween  2.5  and  5.5. 
These  factors  will  increase  as  the  numbers  of  interceptors  and  targets  increase,  since  the  overall 
spatial  volume  of  interest  remains  constant  (thereby  preserving  the  overall  density  of  the 
fcasililc  assignments);  in  essence,  the  synchronization  overhead  in  Fig.  3-3  will  remain 
consttint  (depending  only  on  tlie  number  of  proc.:ssors  used),  while  the  paralleli/.able  work  for 
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the  searches  will  increase  proportionately  to  the  number  of  objects.  For  problems  with  10,(X)0 
objects,  the  overall  speedup  should  mirror  the  speedups  in  the  scan  time,  suggesting  that 
speedups  of  over  10  will  be  possible  using  16  processors. 

3.2.2  Jacobi  AUCTION  Algorithm 

The  second  synchronous  implementation  of  the  AUCTION  algorithm  was  the 
synchronous  Jacobi  AUCTION  algorithm.  In  this  algorithm,  multiple  proces.sors  are  used  to 
generate  bids  simultaneously  for  different  persons.  The  number  of  simultaneous  bids 
generated  is  equal  to  the  minimum  of  the  number  of  processors  used  and  the  number  of 
unassigned  persons;  in  this  manner,  object  prices  are  updated  as  soon  as  possible,  leading  to  an 
expected  reduction  in  the  overall  number  of  bids  required  to  converge  to  an  optimal  solution. 
Each  processor  computes  the  bid  associated  with  a  different  person.  The  resulting  bids  are 
then  processed  sequentially  in  order  to  award  new  auctions  and  to  update  the  list  of  unassigned 
persons.  Sequential  processing  of  the  bids  guarantees  that  the  number  of  iterations  required  for 
convergence  of  the  Jacobi  AUCTION  algorithm  is  independent  of  the  order  in  which 
processors  finish  their  computations. 

The  design  of  the  synchronous  Jacobi  AUCTION  algorithm  is  illustrated  in  Fig.  3-7. 
Again,  there  are  two  synchronization  points  for  each  iteration  of  the  algorithm,  before  and  after 
the  compute  bids  operation.  However,  both  synchronization  points  are  implemented  with  the 
extensions  of  the  barrier  monitors  discussed  previously.  In  particular,  the  synchronization 
after  the  compute  bids  operation  is  only  a  barrier  monitor  because  no  merging  of  the  indi'/’dual 
computations  by  each  processor  is  required  (unlike  the  synchronous  Gauss-Seidel  AUCTION 
algorithm).  This  reduces  the  overall  .synchronization  overhead  by  reducing  the  length  of  the 
critical  .section  in  the  synchronization  monitor.  After  the  bids  have  been  computed,  the  parent 
processor  conducts  the  auction  for  each  bid  and  places  unassigned  persons  back  into  the  queue. 
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Figure  3-7.  Design  of  synchronous  Jacobi  AUCTION  algorithm.  Multiple 
processors  are  used  to  compute  bids  for  multiple  persons 
simultaneously.  The  parent  processor  then  processes  sequentially  the 
bids 


An  important  aspect  of  the  synchronous  Jacobi  AUCTION  algorithm  is  that  the  amount 
of  potential  parallel  work  varies  across  iterations;  specifically,  it  depends  on  the  number  of 
remaining  unassigned  bidders.  When  the  number  of  unassigned  bidders  is  less  than  the 
number  of  available  processors,  some  of  the  processors  will  be  idle.  Figure  3-8  illustrates  the 
number  of  unassigned  bidders  per  bid  iteration  for  a  1000  person,  20%  dense  assignment 
problem,  benefit  range  [1,1000]  using  10  processors.  In  order  to  prevent  idle  processors  for 
competing  for  shared  resources  such  as  synchronization  locks,  the  size  of  tne  synchronization 
barriers  was  adaptively  modified  to  match  the  number  of  non-idle  processors.  Idle  processors 
were  diverted  to  a  rest  barrier,  waiting  to  rejoin  the  computation  when  the  number  of 
unassigned  persons  grew  larger  than  the  number  of  available  processors  (at  the  beginning  of  a 
new  E  -  scaling  phase;  see  Appendix  A). 

Figure  3-9  illustrates  the  performance  of  the  Synchronous  Jacobi  AUCTION  algorithm. 
Again,  .scan  time  is  measured  in  tenns  of  the  time  required  for  the  parent  proce.ssor  to  compute 
a  bid;  scan  time  is  decreased  with  the  number  of  processors  because  the  parent  processor 
computes  less  bids  (other  processors  compute  bids  simultaneously).  Synchronization  time  is 
measured  in  terms  of  the  time  spent  by  the  parent  processor  at  the  two  synchronization 
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Figure  3-8.  Jacobi  AUCTION  number  of  unassigned  persons  versus  iterarion 
number  for  1000  person,  20%  dense  assignment  problem,  benefit 
range  [1,1000]  using  10  processors.  Curves  illustrate  the  number  of 
unassigned  persons  for  different  values  of  e  corresponding  to  different 
e-scaling  cycles.  Note  the  small  fraction  of  iterations  for  which  the 
number  of  unassigned  persons  exceeds  the  number  of  available 
processors  (10). 


barriers.  Note  that,  unlike  the  synchronous  Gauss-Seidel  algorithm,  scan  time  cannot  be 
reduced  arbitrarily  by  increasing  the  number  of  processors.  In  the  Jacobi  AUCTION 
algorithm,  increasing  the  number  of  processors  generally  reduces  the  overall  number  of 
iterations  required  to  converge  (by  computing  multiple  bids  in  parallel);  however,  for  iterations 
where  the  number  of  unassigned  persons  is  less  than  the  number  of  processors,  increasing  the 
number  of  processors  has  no  effect  on  the  number  of  parallel  bids  computed,  thereby  limiting 
the  reduction  possible  in  scan  time. 

Note  the  relatively  low  level  of  synchronization  required  for  the  Jacobi  AUCTION 
algorithm  when  compared  to  the  Gauss-Seidel  AUCI'ION  algorithm,  'fliis  is  due  to  three 
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Figure  3-9.  Performance  of  the  synchronous  Jacobi  AUCTION  algorithm  for 

1000  person,  20%  dense  assignment  problem,  benefit  range  [1,1000] 
as  a  function  of  number  of  processors. 

factors.  First,  the  synchronization  after  computing  bids  is  simpler  because  no  merging  of  the 
results  of  the  processors  is  required.  Second,  the  number  of  synchronization  calls  is  reduced 
because  the  total  number  of  iterations  is  reduced  by  processing  multiple  bids  in  parallel. 
Finally,  the  number  of  processors  which  contend  for  a  synchronization  lock  is  reduced 
adaptively  when  the  numbe"  of  unassigned  persons  is  less  than  the  number  of  processors, 
leading  to  simpler  synchronization  (with  reduced  contention)  at  each  iteration. 

The  results  in  Fig.  3-9  indicate  an  interesting  anomaly  which  is  typical  of  the 
AUCTION  algorithm:  increasing  the  number  of  processors  sometimes  produces  an  apparent 
increase  in  computation  time,  as  indicated  in  the  difference  between  the  10  processor  times  and 
the  8  processor  times.  The  reason  for  this  increase  is  that  the  number  of  iterations  required  for 
convergence  with  10  processors  increased  significiuitly  over  the  number  of  iterations  required 
for  convergence  with  8  processors.  This  is  because  the  computation  of  bids  with  10 
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processors  is  based  on  a  potentially  different  set  of  object  prices  than  the  computation  of  bids 
with  8  processors.  This  fluctuation  in  the  number  of  iterations  required  for  convergence  will 
become  a  dominant  factor  in  the  performance  of  the  asynchronous  AUCTION  algorithms 
discussed  in  Section  4. 

Figure  3-10  illustrates  the  speedups  achieved  by  the  Jacobi  AUCTION  algorithm  for 
several  800  person  and  1000  person  assignment  problems.  The  curves  indicate  that  the 
effective  speedup  from  Jacobi  parallelization  in  the  10%  -70%  density  range  is  not  likely  to 
vary  much  with  either  the  number  of  persons  or  the  density  of  the  problem  (although  the 
potential  speedup  will  decrease  for  very  sparse  assignment  problems).  This  is  in  contrast  with 
Gauss-Seidel  parallelism,  where  the  speedups  possible  increased  with  problem  density  and 
with  assignment  problem  size. 


Figure  3-10.  Speedup  of  parallel  Jacobi  AUCTION  algorithm  over  the  single¬ 
processor  algorithm  as  a  function  of  the  density  of  feasible  assignment 
problems  for  problems  with  800  and  1000  persons,  benefit  range 
11,1000]. 
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3.2.3  Hybrid  AUCTION  Algorithm 

The  results  obtained  with  tne  previous  two  synchronous  algorithms  suggest  that  an 
efficient  parallel  implementation  should  combine  the  speedups  available  from  Gauss-Seidel 
parallelization  and  Jacobi  parallelization.  In  particular,  by  computing  multiple  bids 
Simultaneously,  aini  by  using  muitipie  proces.>cio  tO  compute  each  bid,  a  mulupiicauve  eiTci-i 
may  be  achievable  where  the  overall  speedup  is  the  product  of  the  Gauss-Seidel  speedup  and 
the  Jacobi  speedup.  The  synchronous  Hybrid  AUCTION  algorithm  is  an  attempt  to 
demonstrate  this  multiplicative  speedup;  in  this  algorithm,  persons  are  selected  two  at  a  time, 
and  two  bids  are  computed  in  parallel  (Jacobi  parallelization  with  two  processors).  For  each 
person  i,  the  admissible  objects  A(i)  are  searched  in  parallel  by  multiple  processors  (Gauss- 
Seidel  parallelization). 

TThe  overall  design  of  the  synchronous  hybrid  AUCTION  algorithm  is  illustrated  in  Fig. 
3-11.  There  are  three  synchronization  points  per  iteration.  An  initial  barrier  is  included  to 
delay  the  start  of  the  object  searches  until  all  of  the  object  prices  are  updated  from  the  previous 
iteration.  A  separate  merge  search  monitor  is  included  for  each  person,  and  a  synchronization 
barrier  is  used  to  wait  until  both  bids  are  computed  before  proceeding  to  award  the  auctions. 
The  size  of  the  barriers  and  monitors  were  tailored  to  the  number  of  processors  which 
rendezvous  at  each  synchronization  point.  Thus,  the  first  barrier  synchronized  2k  processors, 
the  merge  search  monitors  k  processors  and  the  last  barrier  only  two  processors,  thereby 
keeping  the  synchronization  overhead  to  a  minimum.  The  predicted  speedup  from  the  hybrid 
approach  should  be  1.75  for  the  use  of  Jacobi  parallelization  with  two  bids  computed 
simultaneously,  multiplied  times  the  appropriate  speedup  (cf.  Fig.  3-3)  for  using  k  processors 
to  compute  each  bid.  Thus,  when  12  total  processors  are  used,  the  overall  speedup  should  be 
approximately  1.75  x  2.75  (from  using  6  processors  per  hid)  =  4.,S125. 
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Figure  3-11.  Design  of  the  synchronous  Hybrid  AUCTION  algorithm. 


Figure  3-12  illustrates  the  performance  of  the  synchronous  Hybrid  AUCTION 
algorithm  as  a  function  of  the  total  number  of  processors  used  for  the  same  1000  person,  20% 
dense  assignment  problem  described  previously.  The  single  processor  time  for  this  algorithm 
is  44  seconds.  The  synchronization  time  is  again  measured  in  terms  of  the  parent  processor, 
and  represents  the  total  time  that  the  parent  processor  spends  at  the  different  synchronization 
points.  As  the  curves  in  Fig.  3-12  indicate,  the  achieved  speedup  is  much  lower  than  the 
anticipated  multiplicative  speedup  from  combining  the  Jacobi  and  Gauss-Seidel  speedups.  For 
example,  the  actual  speedup  using  12  processors  is  under  4,  whereas  the  predicted  speedup  is 
over  4.8.  The  explanation  for  this  loss  of  effectiveness  can  be  seen  in  the  growth  of  the 
synchronization  time  wdth  the  total  number  of  processors  used,  even  though  the  total  number  of 
iterations  has  been  reduced  by  a  factor  of  1.83.  This  synchronization  time  represents  the 
dominant  part  of  the  overall  computation  time  for  large  number  of  processors,  and  prevents 
effective  combination  of  the  speedups  possible  from  Gauss-Seidel  and  Jacobi  parallelization. 
J’his  motivated  the  development  of  asynchronous  Hybrid  AUCTION  algorithms  with  reduced 
synchronization  overhead,  using  the  theory  developed  in  Appendix  A.  These  algorithms  will 
be  discussed  further  in  Section  4. 
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Figure  3-12.  Performance  of  the  synchronous  Hybrid  AUCTION  algorithm  on 

Encore  Multimax  as  a  function  of  the  number  of  processors  for  1000 
person,  20%  dense  assignment  problem,  benefit  range  [1,1000]. 


Figure  3-13  illustrates  the  performance  of  the  parallel  Hybrid  AUCTION  algorithm  and 
the  parallel  Gauss-Seidel  AUCTION  algorithm  as  a  function  of  the  number  of  processors.  As 
expected,  the  Hybrid  AUCTION  algorithm  can  use  a  larger  number  of  processors  in  a  more 
effective  manner,  since  the  merge  and  synchronization  time  is  significandy  reduced  by  having  a 
smaller  number  of  overall  iterations  (from  computing  bids  two  at  a  time)  and  by  merging  the 
results  of  only  half  the  number  of  processors.  However,  Fig.  3-13  also  illustrates  the  absence 
of  a  multiplicative  speedup;  the  ratio  of  the  best  Gauss-Seidel  AUCTION  time  to  the  best 
Hybrid  AUCTION  time  is  about  1.35,  which  is  smaller  than  the  1 .75  factor. 
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Figure  3-13.  Comparijior.  Hybnd  AUCTION  and  Gauss-Seidel  AUCTION 

algorithms  for  similar  numbers  of  processors  for  1000  person,  20% 
dense  assignment  problem,  benefit  range  [1,1000]. 


3.3  SYNCHRONOUS  AUCTION  ALGORITHMS  ON  THE  ALLIANT  FX/8 

The  parallel  algorithms  discussed  in  Section  3.2  were  implemented  on  the  Encore 
Multimax  with  no  assistance  from  any  automated  parallelization  tools.  Parallel  processing  was 
implemented  by  generating  parallel  tasks,  and  having  the  operating  system  of  the  Encore 
Muhimax  schedule  these  tasks  concurrently  on  multiple  processors.  Synchronization  of  these 
tasks  was  achieved  by  writing  explicit  software  monitors,  using  a  spinlock  mechanism  as  the 
basic  synchronization  primitive  provided  by  the  Multimax. 

A  different  approach  for  parallel  algorithm  development  is  to  use  a  parallelizing 
comjnler.  which  searches  for  work  to  do  in  parallel,  and  automatically  distributes  parallel  work 
across  processors.  The  Alliant  FX/8  computer  has  a  Fortran  compiler  with  this  capability  for 
automatic  paralleliztition;  furthennore,  the  Alliant  FX/8  had  other  in  cresting  architectunil 
features  which  made  it  an  interesting  candidate  for  inve.stigation.  These  features  are: 
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1 .  The  automatic  parallelizing  Fortran  compiler; 

2.  The  Alliant  architecture  is  designed  to  implement  several  synchronization  pnmiiucs 
in  hardware,  thereby  reducing  the  overhead  required  for  interprocessor 
synchronization; 

3 .  Each  of  the  Alliant  FX/8's  processors  is  a  vector  processor,  which  is  a  piirticular 
type  of  SIMD  architecture.  Thus,  the  Alliant  FX/8  is  a  hybrid  architecture,  capable 
of  multiprocessor  MIMD  and  SIMD  processing; 

4.  The  Alliant  FX/8  has  a  high-level  array  language  (Fortran  8X)  which  is  similar  to 
the  array  languages  used  on  SIMD  architectures  such  as  the  DAP  510  or  the  CM-2. 

Thus,  conducting  experiments  on  the  Alliant  FX/8  provided  a  natural  transition  from  MIMD 

architectures  to  SIMD  architectures,  and  allowed  us  to  evaluate  the  potential  effectiveness  of 

vector-processing  and  automatic  parallel  compilation  for  implementation  of  parallel  AUCTION 

algorithms. 

On  the  Alliant  FX/8,  we  experimented  only  with  the  Gauss-Seidel  AUCTION 
algorithm.  Foiu-  different  versions  of  the  algorithm  were  developed: 

1 .  Sequential  Gauss-Seidel  AUCTION,  a  Fortran  77  version  using  sparse  data 
structures  which  corresponded  to  the  most  effective  sequential  implementation; 

2.  Parallel  Gauss-Seidel  AUCTION,  a  Fortran  77  version  using  sparse  data 
structures,  which  was  rewritten  to  avoid  data  dependencies  which  restricted  the 
parallelization  capable  of  the  automated  compiler. 

3 .  Gauss-Seidel  AUCTION  8X,  a  Fortran  8X  version  using  dense  data  'ucturcs 
which  was  written  to  represent  the  AUCTION  algorithm  using  array  operations. 

The  sequential  Gauss-Seidel  AUCTION  algorithm  was  identical  to  the  sequential 
version  used  in  the  Encore  Multimax,  and  required  no  further  development.  There  is  a  key 
aspect  to  the  sequential  algorithm  which  must  be  understood  in  order  to  identify  the 
transformations  required  for  developing  the  parallel  Gauss-Seidel  AUCTION  algorithm.  As 
Fig.  3-1  illustrates,  the  key  operation  which  consumes  most  of  the  computation  time  is  the 
computation  of  a  bid.  Referring  to  the  description  of  this  operation  in  Section  2.3,  the 
computations  required  for  a  bid  from  person  i  arc: 


j(i)  =  arg  max,  (ujj  -  pj! 


(3  21 
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v(i)  =  maxj  {a,j  -  pj} 

(3-.U 

w(i)  =  maxj;tj(i)  (aij  -  pj} 

(3-4) 

b(i)  =  pj(i)  +  v(i)  -  w(i)  -i-  e 

(3-5) 

The  difficult  computations  are  in  Eqs.  3-2,  3-3,  and  3-4.  Each  of  these  computations  requires 
searching  the  list  of  admissible  objects  for  person  i,  and  is  a  reduction  operation  which  maps  a 
long  vector  of  numbers  into  a  single  scalar.  In  the  sequential  implementation  of  the  Gauss- 
Seidel  AUCTION  algorithm,  all  three  quantities  (j(i),  v(i),  w(i))  are  computed  in  a  single 
search  of  the  object  list.  However,  this  computation  introduces  data  dependencies  which 
prevent  the  automatic  parallelization  of  these  operations  on  the  Alliant  FX/8. 

In  order  to  achieve  maximum  speedup  and  concurrency  on  the  Alliant  FX/8,  the 
quantities  j(i),  w(i)  and  b(i)  must  be  computed  using  three  separate  searches  of  the  object  list  (a 
fourth  array  operation  is  also  required,  so  the  total  computation  is  nearly  four  Limes  longer). 
Thus,  the  parallel  Gauss-Seidel  AUCTION  algorithm  on  the  Alliant  FX/8  is  significantly 
slower  when  executed  on  a  single  sequential  processor  than  the  sequential  Gauss-Seidel 
AUCl  ION  algoritnm.  Similarly,  the  Gauss-Seidel  AUCTION  8X  algorithm  requires  three 
different  array  search  operations  to  compute  a  bid  for  person  i.  We  defer  discussion  of  the 
implementation  of  the  Gauss-Seidel  AUCTION  8X  algorithm  until  the  next  subsection,  when 
wc  discuss  array  language  implementations  for  the  SIMD  architectures. 

Figure  3-14  illustrates  the  performance  of  the  three  algorithms  for  800-person 
assignment  problems  with  variable  feasible  assignment  density.  Note  the  logarithmic  scale  of 
the  vertical  axis.  Three  different  compiled  versions  of  the  parallel  Gauss-Seidel  AUCTION 
algorithm  were  used:  the  version  compilf’H  to  execute  on  one  sequential  processor  (AUCTION 
1 S).  the  version  compiled  to  execute  on  one  vector  processor  (AUCTION  IV),  and  the  version 
compiled  to  execute  on  all  8  vector  processors  (AUCTION  VC).  The  other  curves  correspond 
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to  the  sequential  Gauss-Seidel  AUCTION  algorithm  (SAUCTION)  and  the  Gauss-Seidel 
AUCTION  8X  algorithm  (AUCTION  8X), 

Note  the  simiku"  behavior  in  Fig.  3- 14  of  the  AUCTION  1 S.  AUCTION  1 V  and 
SAUCTION  algorithms  as  a  function  of  feasible  assignment  density.  In  es.sence.  the  ratios  of 
computadcn  times  between  these  algorithms  is  a  constant  factor,  which  reflects  the  addiiional 
number  of  searches  of  the  object  list  required  by  the  parallel  Gauss-Seidel  AUCTION 
algorithm!  As  predicted,  the  AUCTION  IS  computation  times  are  nearly  four  times  slower 
than  the  SAUCTION  computation  times.  Surprisingly,  the  use  of  vectorization  is  insufficient 
to  fully  compensate  for  this  difference,  so  the  AUCTION  IV  times  are  about  10%  slower  than 
the  SAUCTION  times.  When  both  vectorization  and  concurrency  are  used,  the  AUCTION  VC 
times  are  faster  than  the  SAUCTION  times,  but  the  speedup  depiends  explicitly  on  the  density 
of  tlie  feasible  assignments.  The  maximum  speedup  (achieved  for  the  fully  dense  problem) 
was  nearly  a  factor  of  4  (significantly  smaller  than  the  speedup  on  the  Encore  Multimax  using 
only  scalar  processors).  On  the  other  hand,  when  referenced  with  respect  to  the  AUCTION  IS 
times,  the  AUCTION  VC  times  achieve  a  speedup  of  over  15  for  dense 


i  igurc  3-14.  Pcrfomiance  of  the  different  (iauss-Seidel  /VUC'l'lON  algorithms  on 
the  Allitint  F\78  for  800  person  assignment  problems  with  benefit 
range  |  1 .10001,  as  a  function  of  the  density  of  feasible  assignments. 
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assignment  problems!  This  emphasizes  the  importance  of  using  an  efficient  sequential 
implementation  of  the  AUCTION  algorithm  as  a  scahu"  benchmark. 

Figure  3-14  also  illustrates  the  relative  advantages  of  using  sparse  data  structures 
versus  dense  data  structures.  Note  the  relatively  fla^  AUCTION  8X  computation  times  as  a 
function  of  feasible  assignment  density,  when  compared  wit^"  the  curves  of  the  other 
algorithms  implemented  using  sparse  data  structures.  Note  that,  for  fully  dense  problems,  a 
small  efficiency  is  achieved  by  using  dense  data  structures  (roughly  15%  of  the  overa” 
computation  time.  However,  once  the  problems  become  moderately  sparse  (below  80% 
dense),  the  sparse  AUCTION  VC  implementation  is  significantly  faster  than  the  dense 
AUCTION  8X  implementation. 

Figure  3-15  illustrates  the  performance  of  the  same  5  algorithms  on  a  set  of  1000 
person  assignment  problems.  Again,  the  AUCTION  IS  times  are  nearly  four  times  bigger  than 
the  S  AUCTION  times,  and  the  use  of  vectorization  in  AUCTION  IV  is  insufficient  to 
compensate  for  the  loss  of  efficiency  required  by  scanning  the  admissible  object  list  an 
increased  number  of  times.  The  larger  problem  size  results  in  an  increased  maximum  speedup 
of  the  AUCTION  VC  time  (nearly  4.5  for  fully  dense  problems)  when  compared  with  the 
SAUCTION  times.  Note  the  interesting  anomaly  present  for  the  5%  dense  problem.  The 
vector-concurrent  version  of  the  parallel  Gauss-Seidel  AUCTION  algorithm  is  slower  than 
both  the  vector  version  of  the  same  algorithm  mnning  on  a  single  processor  and  the  sequential 
Gauss-Seidel  AUCTION  algorithm.  This  reflects  the  compiler's  inability  to  select  dynamically 
how  many  parallel  prtxessors  should  be  used  in  the  computation.  For  this  problem,  the  object 
lists  averaged  50  objects;  vectorization  of  the  searches  using  length  32  vectors  is  more  efficient 
than  use  of  multiple  processors,  given  the  small  number  of  objects  to  be  searched. 
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Density  of  feasible  assignments 

Figure  3- 1 5.  Performance  of  the  different  Gauss-Seidel  AUCTION  algorithms  on 
the  Alliant  FX/8  for  1000  person  assignment  problems  with  benefit 
range  [1,1000],  as  a  function  of  the  density  of  feasible  assignments. 

3.4  SIMD  AUCTION  ALGORITHMS 

As  discussed  previously,  the  majority  of  ihe  computation  time  of  the  sequential  Gauss- 
Seidel  AUCTION  algorithm  is  spent  in  the  scan  tperation,  which  consists  of  searching  each 
object  list  A(i)  in  order  to  find  the  object  offering  the  maximal  net  profit,  and  the  tv,o  highest 
profit  levels.  The  goal  of  our  single  instruction  stream,  multiple  data  stream  (SIMD) 
implementations  is  to  reduce  the  overall  time  as.swiated  these  setirches.  An  important  aspect  of 
doing  this  is  to  minimize  movement  of  data  between  processors.  Thus,  the  SIMD  parallel 
algorithm:,  were  designed  without  the  use  of  sparse  data  structures. 

Figure  3-1^"'  illustrates  the  basic  concept  of  the  SIMD  Gauss-Seidel  AUCTION 
algorithm  design,  d’he  SIMD  archi'ecture  is  viewed  as  a  long  vector  of  processors.  Figure 
3- 1  fi  sliows  a  number  of  proce''Sors  w  inch  equals  the  number  of  jiersons  in  the  tissignment 
proldem;  tins  was  tlie  case  for  ll'c  benchmtirk  problems  and  die  algoriihms  imidcmented  in  the 
DAI'  oil)  and  die  ( 'onnection  M.icliine  (’M  2.  \’iewing  the  Ixnn'fits  ;g|  as  ;!  mairix.  each 
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processor  contains  the  j  column  of  the  matrix  (that  is,  {ajj,  i  =  1, n))  and  the  price  pj  of  that 
column.  In  this  manner,  each  processor  can  form  independently  the  net  profit  ajj  -  pj.  The 
maximum  value  of  the  net  profit  and  the  location  of  a  maximal  argument  are  obtained  by 
reductions  of  the  array  of  net  profits  into  scalar  values  using  the  interprocessor  communication 
network.  Since  the  Gauss-Seidel  AUCTION  algorithm  operates  on  only  a  single  person  i  at  a 
time,  the  relevant  data  is  spread  maximally  across  processors,  thereby  maximizing  the  potential 
speedup. 


Figure  3-16.  Illustration  of  the  data  mapping  into  processors  for  parallel  SIMD 
Gauss-Seidel  AUCTION  algorithm.  Each  processor  receives  a 
column  of  the  benefit  matrix  (corresponding  to  a  single  object),  as 
well  as  the  price  of  the  object  corresponding  to  that  column. 

With  this  data-mapping  concept,  we  designed  the  parallel  implementations  of  the  SIMD 
Gauss-Seidel  AUCflON  algorithm  on  the  various  architectures  by  using  the  appropriate  array 
language  extensions  to  implement  the  amay  arithmetic  and  reduction  operations.  On  the  DAP 
510,  the  array  language  used  was  Fortran  Plus;  on  the  Connection  Machine  CM-2,  we  used  the 
C*  language  tits  Fonran  8X  compiler  was  still  under  development).  In  order  to  illustrate  the 
an  ay  operations  required,  we  piuvide  Fortran  8X  versions  of  the  key  computations  rajuired 
for  a  bid  by  person  i,  and  discuss  the  similar  operations  required  for  implementation  in  C*  and 
Fonran  Plus. 
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In  the  bid  phase  of  the  Gauss-Seidel  AUCTION  algorithm,  the  critical  computations  are 
associated  widi  computing  v(i),  j(i)  and  w(i)  as  in  Eqs.  3-2,  3-3  and  3-4.  The  calculation  of 
v(i)  in  Fortran  8X  is  programmed  as: 

MARGINS  =  A(i,:)  -  P 
v(i)  =  maxval(MARGINS) 

As  may  be  seen,  FORTRAN  8X  permits  direct  calculation  to  be  made  on  vectors  and  arrays. 
Thus,  P  and  MARGINS  are  length  n  vectors,  A  is  an  n  x  n  matrix,  and  v(i)  is  a  scalar.  The 
construct  A(i,:)  refers  to  the  ith  row  of  the  matrix  A.  The  function  maxval  is  a  reduction 
operator  which  returns  the  value  of  the  largest  element  contained  in  its  vector  or  array 
argument.  Both  the  C*  and  Fortran  Plus  languages  contain  reduction  operators  (>?=  in  C*, 
maxv  in  Fortran  Plus)  which  are  equivalent  to  the  above  maxval  operator. 

Computation  of  j(i)  can  now  be  evaluated  in  Fortran  8X  as: 

MBIDS  =  MARGINS  .eq.  v(i) 

TEMP  =  oo 

where(MBIDS)  TEMP  =  INDICES 
j(i)  =  minval(TEMP) 

In  this  excerpt,  MBIDS  is  a  logical  vector  marking  all  occurrences  of  v(i)  in  MARGINS, 
INDICES  is  a  vector  storing  the  indices  {1,2, ...,  n}  and  TEMP  is  set  to  the  integer  indices  of 
these  occurrences  by  the  where  statement.  Thus,  j(i)  is  the  index  of  the  first  occurrence  of  v(i) 
in  .MARGINS.  Again,  both  C*  and  Fonran  Plus  contain  masked  assignment  operators 
corresponding  to  the  Fortran  8X  construct. 

The  remaining  oarameter  required  for  the  computation  of  a  bid  is  w(i).  The  Fortran  8X 
code  for  the  computation  of  the  remaining  piu'ameter  is  given  by: 

w(i)=  maxvaKMARGINS,  mask=INDICES.ne.j(i)) 

'fhe  additiona!  feature  of  the  I'ortran  8X  code  to  be  noted  here  is  that  the  maximum  be  taken  of 
a  specified  subset  of  elements  of  MARGINS.  This  is  accomplished  by  the  keyword  argument 
mask  -  <>.  .Similar  masked  reduction  operators  exist  in  C*  and  Fortran  Plus. 
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Figure  3-17  illustrates  the  performance  of  the  SIMD  Gauss-Seidel  AUCTION 
algorithms  on  the  DAP  510  and  the  Connection  Machine  CM-2  for  800  person  assignment 
problems  as  a  function  of  feasible  assignment  density.  For  comparison,  we  have  included  the 
times  of  the  sequential  Gauss-Seidel  AUCTION  algorithm  on  the  Encore  Multimax  and  Fortran 
8X  implementation  of  the  Gauss-Seidel  AUCTION  on  the  Alliant  FX/8  using  8  vector 
processors.  Note  that  the  use  of  dense  data  structures  in  the  SIMD  algorithms  results  in  a 
computation  time  which  is  effectively  constant  with  feasible  assignment  density.  In  essence, 
only  a  fraction  of  the  processors  (equal  to  the  feasible  assignment  density)  are  doing  useful 
work  in  the  sparse  assignment  problems.  In  contrast,  the  sequential  Gauss-Seidel  AUCTION 
algorithm  uses  sparse  data  structures,  and  its  overall  computation  time  is  reduced  significantly 
for  sparse  assignment  problems.  Figure  3-18  illustrates  similar  results  for  1000  person 
assignment  problems. 


Density  of  feasible  assignments 

Figure  3- 1  7.  Computation  times  of  SIMD  Gauss  Seidel  AUCflON  algorithms  for 
800  person  assignment  problems  (benefit  range  i  1,1000|)  as  a 
function  of  feasible  assignment  density. 
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Jacobi  parallelization)  in  the  C-PARIS  language  for  assignment  problems  with  comparable 
numbers  of  persons  and  similar  benefit  range. 

The  second  observation  is  aimed  at  explaining  the  exceptionally  fast  performance  of  the 
DAP  510  on  this  class  of  algorithms.  In  essence,  the  DAP  510  communications  architecture 
allows  it  to  execute  reduction  operations  such  as  minval  in  a  time  which  is  independent  of  the 
number  of  processors  used  for  the  reduction  operation.  Furthermore,  these  reductions  use  a 
specific  bit-level  algorithm  across  all  processors  which  allows  the  computation  of  the  minimum 
of  1024  32  bit  numbers  in  about  12  microseconds. 

To  illustrate  this  algorithm,  consider  taking  the  minimum  of  the  following  list  of  4 
numbers;  (6,  2,  10,  2).  The  binary  representation  of  these  numbers  is 

Decimal  Binary 

6  0110 

2  0010 

10  1010 

2  0010 

The  minval  routine  on  the  DAP  510  employs  only  logical  bit  operations  and  succeeds  in 
locating  all  occurrences  of  the  smallest  value.  Specifically,  this  routine  computes  a  vector  of 
bits  (of  length  equal  to  the  number  of  elements)  whose  0-bits  locate  the  minima  present  in  the 
list.  Note  that,  in  the  list  of  binary  numbers  above,  the  column  of  most-significant  bits  0010 
contains  the  information  that  the  third  number  in  the  list  cannot  be  the  smallest.  Therefore, 
0010  locates  the  minimum  as  being  among  the  first,  .second  and  fourth  numbers  in  the  list.  In  a 
.second  application  of  the  same  reasoning,  note  that  a  Boolean  OR  combination  of  the  first 
column  with  the  second  column  (0010  OR  10(X)  =  1010)  further  narrows  the  choices  for 
minimum  to  the  second  and  fourth  numbers. 

The  complete  algorithm  for  minval  on  the  DAP  510  is  essentially  equivalent  to  OR 
together  all  of  the  bit  columns  of  the  analogous  binary  representation  of  a  list  of  numbers, 
starting  at  the  most  significant  end.  Some  care  must  be  taken  in  order  to  avoid  obtaining  a 
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vector  of  all  I's.  Whenever  the  running  result  comes  up  all  ones,  its  previous  value  must  be 
used  to  continue  the  algorithm.  That  would  be  necessary,  for  example,  in  the  next  step  of  the 
sample  calculation  above.  Detecting  such  a  condition  can  be  done  efficiently  in  the  DAP  5 10 
because  of  its  ability  to  efficiently  test  bits  across  all  processors. 

A  similar  approach  is  used  for  implementing  the  maxval  reduction  operation  which  is 
used  in  the  AUCTION  algorithm.  Thus,  as  long  as  the  number  of  processors  is  larger  than  the 
number  of  objects,  the  DAP  architecture  provides  a  near-optimal  match  to  the  computation 
requirements  of  the  Gauss-Seidel  AUCTION  algorithm  for  dense  assignment  problems. 

There  are  several  unresolved  issues  associated  with  the  use  of  SIMD  architectures  for 
implementation  of  the  AUCTION  algorithm.  The  first  issue  involves  the  potential  use  of 
sparse  data  structures.  For  large  sparse  assignment  problems,  a  lot  of  the  available  memory  is 
wasted  in  each  processor  when  using  dense  data  structures.  However,  using  sparse  data 
structures  will  require  data  movements,  which  will  reduce  the  efficiency  of  the  SIMD 
architectures.  For  applications  using  sparse  data  structures,  the  more  flexible  communication 
network  structure  of  the  CM'2  (versus  the  grid  structure  of  the  DAP  510)  may  offer  some 
advantages. 

The  second  issue  concerns  whether  a  combination  of  the  spccdup.s  possible  from 
Gauss-Seidel  and  Jacobi  parallelism  on  a  SIMD  architecture  is  possible.  This  requires  the 
ability  to  compute  bids  for  multiple  persons  simultaneously.  Although  such  an  arrangement  is 
possible  on  the  Connection  Machine  CM-2  by  careful  arrangement  of  the  data  across  different 
processors  (see  [25]  for  a  discussion),  the  persons  which  will  be  unassigned  at  any  one 
iteration  are  not  known  apriori,  so  that,  in  practice,  data  movements  among  processors  may  be 
required  to  achieve  this  combination.  Again,  this  would  lead  to  a  decrease  in  the  overall 
efficiency  of  the  parallel  SIMD  AUCTION  algorithm. 

In  spite  of  the.se  unresolved  issues,  SIMD  architectures  offer  the  promise  of  significant 
computation  reduction  for  large  assignment  problems.  Figures  3-17  and  3- IS  illustrate  that  the 
Gauss-Seidel  AUCTION  algorithm  was  nearly  two  orders  of  magnitude  faster  on  the  D.\l^  510 
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than  on  a  sequential  processor.  Figures  3-19  and  3-20  illustrate  the  relative  perfonnance  of  the 
DAP  510  algorithm  when  compared  with  the  fastest  MIMD  algorithms  on  the  Alliant  FX/8  and 
the  Fncore  Multimax.  Even  for  very  sparse  problems  (density  5%),  the  computation  time  on 
the  DAP  510  using  dense  data  structures  was  comparable  to  the  computation  times  achieved  by 
the  fastest  parallel  MIMD  algorithms. 


Density  of  Feasible  Assignments 

Figure  3-19.  Performance  of  best  MIMD  and  SIMD  Gauss-Seidel  AUCTION 
algorithms  for  800  person  assignment  problems,  benefit  range 
11,10001 
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SECTION  4 

ASYNCHRONOUS  PARALLEL  AUCTION  ALGORITHMS 

4.1  LNTRODUCTION 

In  the  previous  Section,  we  discussed  our  designs  of  parallel  AUCTION  algorithms  for 
implementation  on  MIMD  and  SIMD  machines.  The  design  of  these  parallel  algorithms 
included  sufficient  synchronization  in  order  to  guarantee  that  the  bids  generated  by  the  parallel 
and  sequential  algorithms  were  identical.  However,  this  synchronization  often  prevents 
efficient  distribution  of  the  computational  load  across  processors,  thereby  reducing  the 
ciiiciency  of  the  parallel  AUCTION  algorithms. 

The  AUCTION  algorithm  is  a  natural  candidate  for  asynchronous  implementation,  as 
discussed  in  Appendix  A.  In  an  asynchronous  implementation,  bid  calculations  may  be  done 
with  out-of-date  object  price  information  and  the  highest  bidder  awards  and  subsequent  price 
adjustments  may  be  done  with  out-of-date  bid  information.  The  potential  advantage  of  an 
asynchronous  implementation  is  a  reduction  of  the.  so-called,  synchronization  overhead.  This 
is  the  delay  incurred  when  several  processors  synchronize  to  calculate  in  parallel  a  single 
person  bid,  when  several  processors  calculating  separate  person  bids  in  parallel,  wait  to  make 
sure  lliat  up-to-date  price  information  is  available,  and  w'hen  the  processors  calculating  in 
parallel  the  highest  bidder  awards  wait  for  ali  bids  to  come  in.  Asynchronous  algorithms  are 
discussed  in  detail  in  [28],  which  gives  many  other  references. 

In  this  section  we  explore  the  merits  of  various  asynchronous  implementations  of  the 
AUCTION  algorithm  in  a  shared  memory  MIMD  multiprocessor:  the  Encore  Multimax.  The 
validity  of  such  an  asynchronous  implementation  is  established  in  Appendix  A.  We  compare 
the  performance  of  the  synchronous  and  asynchronous  implementations  of  the  AUCTION 
algorithm,  in  an  effort  to  quantify  the  tradeoffs  between  ’  ’cobi  and  Gauss-Seidel 
parallelization,  as  well  as  the  effects  of  asynchronism.  To  our  knowledge,  this  is  the  first  work 
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to  report  on  the  practical  performance  of  asynchronous  versions  of  the  AUCTION  algorithm  in 
a  real  parallel  machine. 

4.2  ASYNCHRONOUS  IMPLEMENTATION  OF  THE  AUCTION 
ALGORITHM 

In  this  subsection,  we  describe  the  asynchronous  implementations  of  the  AUCTION 
algorithm  using  the  model  for  asynchronous  computation  described  in  Appendix  A.  As  in  the 
synchronous  AUCTION  algorithms,  we  describe  the  asynchronous  algorithms  in  terms  of  the 
bid  phase  and  the  auction  phase  of  each  iteration.  The  difference  between  the  synchronous  and 
asynchronous  algorithms  is  that  the  information  used  in  the  bid  and  auction  phases  may  be  out 
of  date,  as  discussed  in  Appendix  A. 

In  our  asynchronous  implementations,  the  bid  calculations  for  a  person  i  are  divided 
into  two  types  of  tasks:  search  tasks,  corresponding  to  searching  a  subset  of  the  feasible 
objects  A(i),  and  bid  tasks,  corresponding  to  merging  the  results  generated  by  the  various 
search  tasks  corresponding  to  person  i  and  generating  a  bid  for  person  i.  These  tasks  are 
organized  in  a  first  in  —  first  out  queue.  When  a  processor  becomes  free  it  starts  executing  the 
top  task  of  the  queue  if  the  queue  is  nonempty  and  otherwise  it  checks  whether  a  termination 
condition  is  satisfied.  The  algorithm  stops  when  all  processors  encounter  the  termination 
condition.  Similarly  to  the  synchronous  Gauss-Seidel  implementation,  each  set  of  admissible 
objects  A(i)  is  divided  in  k  groups  of  objects  Ai(i), ...,  A)^(i).  The  calculation  of  the  bid  of  a 
person  i  is  divided  into  k  tasks,  where  each  task  involves  a  different  group  of  objects.  To 
perfoni)  one  of  diese  tasks,  a  processor  must  calculate  and  store  in  memor\'  the  best  value, 
second  best  value,  and  best  object  within  the  corresponding  object  group. 

In  addition  to  the  search  tasks,  a  bid  task  is  created  for  each  unassigned  person  i.  This 
bid  task  reads  the  results  of  the  individual  searches  stored  in  memory  and  completes  the  bid  of 
person  i  by  merging  the  individual  group  .search  results,  that  is,  by  finding  the  best  object  and 
bid  for  person  i  based  on  the  currently  stored  group  results.  The  hid  task  also  includes  raising 
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the  price  of  the  best  object  and  changing  the  assignment  of  the  object  (assuming  the  calculated 
bid  is  larger  than  the  best  object's  price  by  at  least  e). 

There  are  two  sources  of  asynchronism  in  this  implementation.  First,  it  is  possible  for 
some  prices  to  be  changed  between  the  time  a  search  task  is  completed  and  the  time  the  results 
of  that  task  are  used  to  calculate  a  person  bid.  Second,  it  is  possible  that  the  bid  task  of  a 
person  is  carried  out  before  some  of  the  search  tasks  associated  with  that  bid  are  completed. 

In  both  cases  the  bid  may  reflect  out-of-date  price  information  and  may  prove  ineffective  in  that 
it  yields  a  bid  that  does  not  exceed  the  corresponding  best  object's  price  by  at  least  e.  The 
advantage  of  the  asynchronous  implementation  is  that  processors  do  not  remain  idle  waiting  to 
get  synchronized  with  other  processors  or  waiting  for  merging  tasks  to  be  completed. 

The  above  implementation  can  be  specialized  to  implement  asynchronous  algorithms 
which  are  equivalent  to  the  Gauss  Seidel,  Jacobi  and  Hybrid  AUCTION  algorithms  by 
controlling  the  number  of  search  tasks  generated  for  each  unassigned  person  and  the 
distribution  of  tasks  among  processors.  If  a  single  search  task  is  generated  per  unassigned 
person,  and  this  search  task  and  its  corresponding  bid  task  are  assigned  to  a  single  processor, 
the  resulting  algorithm  corresponds  to  an  asynchronous  implementation  of  the  Jacobi 
AUCTION  algorithm.  If  the  number  of  search  tasks  per  una.ssigned  bidder  is  equal  to  the 
number  of  processors,  the  resulting  algorithm  is  an  asynchronous  implementation  cf  the 
Gauss-Seidel  AUCTION  algorithm.  Asynchronous  hybrid  variations  are  obtained  by 
modifying  the  ratio  of  the  number  of  processors  used  to  the  number  of  search  tasks  generated 
per  unassigned  bidder.  In  the  following  subsections,  we  discuss  the  results  obtained  from  our 
implementations  of  the  asynchronous  Jacobi  AUCTION  algorithm  and  two  asynchronous 
Hybrid  AUC'l'lON  algorithms. 

4.3  ASYNCHROiNOUS  JACOBI  AUCTION  AL(;ORHTTM 

The  asynclironoLis  Jacobi  AUCTION  algorithm  design  is  aimed  at  reducing  tlie  overall 
synclironi/.ation  overhead  by  allowing  bids  to  be  computed  based  on  older  \  alues  of  the  object 
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prices.  Specifically,  processors  start  computing  new  bids  without  waiting  for  other  processors 
to  complete  their  price  updates.  Some  synchronization  is  still  required  to  guarantee  that  the 
prices  of  each  object  are  changed  in  an  appropriate  order,  and  to  guarantee  that  each  processor 
computes  the  bid  of  a  different  person.  This  synchronization  is  implemented  using  locks  on 
each  object  and  a  lock  on  the  queue  of  unassigned  persons;  these  locks  allow  only  one 
processor  at  a  time  to  modify  the  price  of  a  given  object,  and  only  one  processor  at  a  time  to 
update  the  queue  of  unassigned  persons.  Figure  4-1  illustrates  the  design  of  the  asynchronous 
Jacobi  AUCTION  algorithm.  In  order  to  reduce  contention  for  the  locks,  when  the  number  of 
persons  in  the  unassigned  persons  queue  is  lower  than  the  number  of  processors,  excess 
processors  are  diverted  to  a  barrier  to  wait  for  a  new  e-scaling  cycle. 


Figure  4- 1 ,  Design  of  Asynchronous  Jacobi  AUCTION  algorithm.  Locks  on  each 
object  and  on  the  unassigned  persGr..s  queue  are  used  to  guarantee  data 
integrity  and  preserve  complementtiry  slackness. 


'file  performance  of  the  asynchronous  Jacobi  AUCI'ION  algorithm  is  illustrated  in  Fig. 
4-2.  'I'he  numbers  shown  represent  an  average  of  three  runs;  the  actual  running  time  of  the 
algorithm  varies  from  run  to  run  becau.se  the  order  in  which  different  pr(.x;es.sors  complete  their 
bids  and  actiuire  the  locks  affects  tlie  order  in  wh'ch  object.s  are  inserted  into  tiie  unassigned 
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persons  queue.  A  different  ordering  of  persons  produces  a  different  auction  process,  which 
affects  the  total  computation  time.  The  curves  in  Fig.  4-2  represent  the  total  computation  time, 
and  the  number  of  bidding  iterations  performed  by  the  parent  process.  Note  the  close 
correlation  between  these  two  curves,  indicating  a  minimal  amount  of  synchronization 
overhead.  Note  furthermore  that  the  computation  times  are  reduced  to  nearly  7.4  seconds, 
which  represents  a  28%  improvement  over  the  minimum  times  achieved  by  the  synchronous 
Jacobi  AUCTION  algorithm  in  Section  3.2.2.  This  improvement  is  achieved  because  of  the 
improved  load  balance  among  processors,  as  processors  do  not  wait  idly  for  other  processors 
to  complete  their  bidding  process.  Note  that  there  is  no  apparent  slowdown  of  the  achievable 
processor  efficiency  with  increased  number  of  processors,  unlike  the  performance  of  the 
synchronous  Jacobi  AUCTION  algorithm. 


Figure  4-2.  Ferfonnance  of  the  asynchronous  Jacobi  AUCTION  algorithm  for 

1000  person,  20%  dense  assignment  problem,  benefit  range  1 1,1000). 
The  number  of  iterations  by  the  parent  processor  are  also  indicated. 
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4.4  ASYNCHRONOUS  HYBRID  AUCTION  AUCJORITHMS 

'fhe  results  obtained  using  the  asynchronous  Jacobi  AUd’ION  algorithm  indicate  that 
reducing  the  synchronization  per  iteration  can  improve  tlie  performance  of  the  parallel 
AUCTION  algorithms.  The  de.signs  of  the  asynchronous  Hybrid  AUCTION  algorithms  were 
aimed  at  developing  asynchronous  algorithms  which  effectively  combined  the  speedups  of 
Jacobi  and  Gauss-Seidel  parallelization.  The  designs  of  the  two  asynchronous  algorithms 
differ  slightly,  and  follow  closely  the  theory  of  the  asynchronous  AUCTION  algorithm 
presented  in  Appendix  A. 

Figure  4-3  illustrates  the  design  of  the  asynchi  onous  Hybrid  A'  'CJ'ION  algorithms. 
Instead  of  an  unassigned  person  queue,  there  is  a  queue  of  unassigned  search  tasks  and  bid 
tasks.  Each  unassigned  person  is  represented  by  S  search  tasks  and  one  bid  task  in  this  queue, 
ordered  consecutively  in  the  queue,  so  that  the  bid  task  follows  the  S  search  tasks.  Different 
types  of  asynchronous  algorithms  can  be  generated  by  controlling  the  numbei  of  search  tasks 
generated  for  each  unassigned  person.  As  before,  a  synchronization  lock  is  required  to  allow' 
tasks  to  lx:  reati  and  generated  one  at  a  time. 

Figure  4-3  illustrates  the  processing  of  a  single  processor.  After  reading  a  task  from 
the  task  queue,  the  processor  determines  whether  it  is  a  search  task  or  a  bid  task.  If  it  is  a 
search  ttisk  for  btJdcr  i,  the  prtxessor  searches  the  appropriate  segment  of  the  objects  A(i)  and 
w  rites  ;i  messagi  in  shared  memory  with  the  results  of  its  search  (the  two  highest  net  pnifit 
les'cls  and  the  object  offering  the  highest  net  profit).  The  message  is  protected  by  a  lock 
index'-’d  by  ilic  processor  itulex  and  the  penson  index,  which  guarantees  that  the  message  must 
be  rctid  in  its  entirety  by  ttie  bid  task.  After  writing  the  message,  the  ppx'Cssor  relc.tses  the 
'oci.  ;indi  attempts  to  acquire  ttivshe''  Itisk. 

If  the  Uok  acipiircd  i^  a  Fid  I.isk,  the  processor  must  rc.'id  the  messagL.'  left  b\  the 
s'/aicli  tasks  foi  tills  peisoii.  .Some  oi  these  search  tasks  may  still  Ix' m  [xocess,  so  ilic  bid 
pro(  e  ,sor  nia\  tx'  readm;'  old.  mc'^sages.  The  [iroeessor  iix'ks  ea.cli  messaee.  readN  tlie 
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contents,  releases  the  lock  and  merges  the  results  of  the  individual  search  tasks  into  an  overall 
search  result.  'I'his  is  then  used  to  compute  a  bid  (from  person  i  to  object  j).  The  processor 
then  locks  object],  updates  the  price  and  assignment  of  object  j  and  releases  the  object.  If  an 
unassigned  person  results  from  this  operation,  the  processor  then  locks  the  unassigned  task 
queue,  inserts  S  search  tasks  and  one  bid  task  at  the  end  of  the  queue  for  the  unassigned 
person,  and  releases  the  queue. 


I-'igure  4-3.  Design  of  the  a.synchronous  Hybrid  AUCTION  algorithm. 

I'he  algorithm  described  above  is  the  asynchronous  hybrid  AUCTION  11  algorithm. 
The  tlilTiculty  with  ihi  algorithm  is  that  a  bid  is  often  computed  based  on  outdated  messages, 
leading  to  a  large  increase  in  the  number  of  losing  bids  (and  therefore  the  number  of  iterations 
recjuired  lor  convergence).  Ideally,  the  bid  task  for  person  i  would  wait  for  the  search  tasks  for 
person  i  to  be  completed;  however,  this  requires  synchronization.  In  the  Hybrid  AUCl  ION  I 
algorithm,  the  processoi  that  acquires  'he  hist  search  task  corresponding  to  a  person  also 
actpnies  the  bid  task  corresponding  to  that  person.  'I’his  pr(x:es:;or  executes  the  search  task 
lirst.  then  the  bid  task.  In  this  maitner.  the  likelihood  that  the  other  search  tasks  coiTCsponding 
to  iiiat  person  an-  com[)lete  bv  the  time  the  bid  task  is  executed  is  substtintially  increased. 
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Figure  4-4  illustrates  the  performance  of  one  variation  of  the  asynchronous  Hybrid 
AUCTION  I  algorithm  for  the  same  1000  person,  20%  dense  assignment  problem.  In  this 
variation,  the  number  of  search  tasks  generated  per  person  is  equal  to  half  the  total  number  of 
processors  used.  In  this  manner,  the  results  are  comparable  to  the  results  obtained  using  the 
syncltronous  Hybrid  AUCTION  algorithm  with  two  computed  bids  simultaneously.  As 
before,  the  number  of  iterations  required  for  convergence  depends  on  the  order  in  which  the 
processors  complete  their  tasks,  and  varies  between  different  executions  of  the  algorithm.  The 
times  shown  are  the  average  times  of  three  runs.  Contrasting  the  results  of  Fig.  4-4  with  those 
of  Fig.  3-12,  we  see  that  the  asynchronous  Hybrid  AUCTION  I  algorithm  achieves  nearly  a 
30%  reduction  in  computation  time  over  the  corresponding  synchronous  Hybrid  AUCTION 
algorithm.  Notice  that  the  minimal  times  of  both  curves  occur  around  10  processors;  adding 
additional  processors  increases  the  overhead  for  merging  the  results  of  additional  searches, 
thereby  detracting  from  overall  performance  in  both  the  synchronous  and  asynchronous  cases. 
The  computation  reduction  of  the  asynchronous  algorithm  is  due  to  improvements  in  load¬ 
balancing  and  reduced  synchronization  overhead.  Load  balance  among  processors  is  improved 
by  having  search  tasks  conducted  in  parallel  with  bid  tasks,  thereby  keeping  the  majority  of  the 
processors  performing  useful  computations.  Reduced  synchronization  is  accomplished  by 
removing  global  synciironization  primitives  such  as  ba; tiers  and  monitors,  instead  replacing 
these  by  locks  on  the  specific  data  items  (such  as  messages)  for  which  integrity  must  be 
maintained. 

Figure  4-5  illustrates  the  perfomiance  of  the  asynchronous  Hybrid  .AUCflON  I 
algorithm  using  16  total  processors  as  the  numbers  of  bid  and  search  tasks  are  varied.  The 
goal  of  the  Hybrid  AUCTION  I  algorithm  is  to  obtain  a  multiplicative  combination  of  Jacobi 
and  fiauss-Seidel  speedups;  the  results  of  Fig.  4-5  indicate  that  the  asynchronous  Hybrid 
■AI  (iTKJ.N  I  algorithm  has  approached  close  to  a  multiplicative  combination  lor  the  optimal 
choices  of  numbers  of  processors  and  search  tasks.  There  is  a  noticeable  dropoff  in  the 
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Number  of  Processors 

Figure  4-4.  Average  computation  time  of  asynchronous  Hybrid  AUCTION  I  for 
1000  person,  20%  dense  assignment  problem,  benefit  range  [1,1000]. 
The  times  shown  are  the  average  of  three  different  runs  on  the  Encore 
Multimax.  In  these  problems,  the  number  of  search  tasks  per  bid  was 
equal  to  1/2  the  number  of  processors. 

combined  effectiveness  when  large  numbers  of  search  tasks  are  generated  for  each  bid.  The 
reason  for  this  dropoff  is  that  the  total  synchronization  overhead  associated  w  ith  each  iteration 
increases  because  the  overall  length  of  the  task  queue  increases;  this  length  is  equal  to  the 
number  of  search  tasks  per  bid  times  the  number  of  bids  required  for  the  algorithm  to 
converge,  and  thus  grows  lif’early  with  the  number  of  search  tasks.  Since  synchronization 
(using  locks)  is  use  1  to  maintain  the  integrity  of  the  task  queue,  the  synchronization  overhead 
increases  as  the  number  of  search  tasks  per  bid  increases  for  a  fixed  number  of  p’'(x;essors. 

As  the  results  of  b'ig,  4-.S  indicate,  the  asynchronous  Hybrid  AUCTION  I  algorithm 
approached  a  successful  combination  of  the  speedups  frossible  from  Jacobi  and  Gauss-Seidel 
parallelism  through  a  careful  management  of  the  order  in  whicli  tasks  are  selected  for 
processing.  In  order  to  illustrate  the  effects  of  more  genenil  tisynchn'inous  implementations. 
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we  designed  the  asynchronous  Hybrid  AUCTION  II  algorithm,  which  was  identical  to  the 
Hybrid  AUCTION  I  algorithm  except  that  the  bid  tasks  would  be  assigned  to  the  first  available 
processor  after  all  the  corresponding  search  tasks  for  that  bid  had  been  assigned  (as  opposed 
to  assignment  to  the  same  processor  which  selected  the  last  search  task).  Figure  4-6  illustrates 
the  relative  performance  (averaged  across  three  runs)  of  the  asynchronous  Hybrid  AUCTION  1 
and  II  algorithms  for  the  same  1000  person  assignment  problem.  In  these  experiments,  the 
two  search  tasks  per  bid  are  generated.  Clearly,  the  asynchronous  Hybrid  AUCTION  II 
algorithm  is  neai'ly  twice  as  slow  as  the  asynchronous  Hybrid  AUCTION  I  algorithm.  The 
reason  for  this  behavior  is  illustrated  in  Fig.  4-7,  which  describes  the  number  of  bids  generated 
by  each  algorithm  for  convergence  to  an  optimal  solution.  In  essence,  the  number  of  bids 
required  more  than  doubles  for  the  asynchronous  Hybrid  AUCTION  II  algorithm!  This  is 
because  the  bid  task  is  generating  the  bids  before  the  search  tasks  have  completed  their  scans; 
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Number  of  Search  Tasks  per  Bid 

Figure  4-5.  Comparison  of  predicted  and  actual  speedups  achie\'ed  by  the 
asynchronous  Hybrid  AUCT'ON  I  algorithm  for  1000  person, 
dense  assignment  problem,  benefit  range  |  1,1()()()|. 
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Figure  4-6.  Performance  of  different  asynchronous  Hybrid  AUCTION  algorithms 
for  1000  person,  20%  dense  assignment  problems,  benefit  range 
[1,10001. 
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these  bids  based  on  old  information  are  often  rejected,  so  that  adoirional  bids  are  required.  The 
results  illustrate  the  importance  of  careful  management  of  asynchronous  tasks  in  order  to 
guarantee  that  the  processors  are  doing  useful  work  (i.e.  work  that  will  not  become  irrelevant 
when  new  information  is  acquired.) 

Figure  4-8  compares  the  performance  of  the  synchronous  Hybrid  AUCTION  algorithm 
with  the  performance  of  the  asynchronous  Hybrid  AUCTION  I  algorithm  for  1000  person 
assignment  problems  with  varying  feasible  assignment  density.  In  these  experiments,  the 
number  of  search  tasks  generated  per  unassigned  bidder  was  equal  to  half  the  number  of 
processors  selected;  on  the  average,  only  two  simultaneous  bids  were  computed  by  the 
asynchronous  algorithm,  making  it  comparable  to  the  synchronous  Hybrid  AUCTION 
algorithm.  The  computation  times  of  the  asynchronous  algorithm  are  averaged  across  3  runs. 
.Note  the  significant  reduction  in  computation  time  achieved  by  the  asynchronous  algorithm; 
this  improvement  reflects  the  improvement  in  load  balancing  across  the  multiple  processors 
used. 


Figure  4-8,  Fcrfomiance  of  synchronous  I  hbrid  AUC'FION  and  asynclironous 
Hybrid  AUCTION  I  algorithms  on  1(X)()  person  assignment  problems 
of  vary  ing  density,  benefit  range  [  1,10()()|. 
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APPENDIX  A 

THE  AUCTION  ALGORITHM 

In  this  Appendix,  we  overview  the  theory  of  the  AUCTION  algorithm,  desciiuc  a 
model  for  an  asynchronous  variation  of  the  algorithm,  and  establish  that  this  asynchronous 
variation  obtains  an  optimal  solution  to  the  assignment  problem. 

A.l  THE  AUCTION  ALGORITHM  FOR  ASSIGNMENT  PROBLEMS 

In  the  assignment  problem,  n  persons  wish  to  allocate  among  themselves  n  objects,  on 
a  one-to-one  basis.  Each  person  i  must  select  his  object  from  a  given  subset  A(i).  There  is  a 
given  benefit  ay  that  i  associates  with  each  object  J  in  A(i).  An  assignment  is  a  set  of  k 
person-object  pairs  (ii,Ji), .  .  . ,  (ikjk).  such  that  0  <  k  <  n,  Jm  e  A(im)  for  all  k,  and  the 
persons  ij, ...,  ij^  and  objects  Ji, ...,  Jk  are  all  distinct.  The  total  benefit  B  of  the  assignment  is 
the  sum  of  the  benefits  of  the  assigned  pairs. 

k 

B=  X  j 

*  mJ  m 

m  =  1 

An  assignment  with  is  called  complete  (or  incomplete  )  if  it  contains  k  =  n  (or  k<n, 
respectively)  person-object  pairs.  We  want  to  find  a  complete  assignment  with  maximum  total 
benefit,  assuming  that  there  exists  at  least  one  complete  assignment.  This  is  the  classical 
assignment  problem,  studied  algorithmically  by  many  authors  [A.l,  A.l.  A.-T  A.4,  A. 5,  A. 6. 
A.l.  A.X,  A. 9,  A. 10,  A.ll,  A. 12,  A. 13,  A. 14,  A. 151,  beginning  w'ith  Kuhn's  Hungarian 
method  |A.16|. 

In  the  AUCTION  algorithm,  each  object  J  has  a  price  pj  with  the  initial  prices  being 
arbitrary.  Prices  are  adjusted  upwards  as  persons  "bid"  for  their  "best"  object,  that  is,  the 
('bject  for  which  the  corresponding  benefit  minus  the  price  is  maximal.  Only  persons  without 
tin  object  submjt  a  bjd,  and  objects  are  awarded  to  their  highest  bidder.  In  particular,  the  pnees 
p,  tire  adjusted  at  the  end  of  "bjddjng"  iterations.  At  the  Itcginning  of  etich  itenttion.  we  httve  a 
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set  of  object  prices  and  an  incomplete  assignment,  and  the  algorithm  terminates  when  a 

complete  assignment  is  obtained.  Each  iteration  involves  of  sub.set  I  of  the  persons  that  are 

unassigned  at  the  beginning  of  the  iteration.  It  has  two  phases: 

Bidding  Phase:  Each  person  i  e  I  determines  an  object  ji  e  A(i)  for  which  ajj  -  pj 
is  maximized  over  j,  i.e. 

ji  =  argmaxjeA(i)(aij-pj) 

and  submits  a  bid  pjj  +  gi  for  this  object,  where  gi  is  a  positive  bidding  increment  to 
be  specified  shortly. 

Assignment  Phase:  Each  object  j  that  receives  one  or  more  bids,  determines  the 
highest  of  these  bids,  increases  pj  to  the  highest  bid,  and  gets  assigned  to  the 
person  who  submitted  the  highest  bid.  The  person  that  was  assigned  to  j  at  the 
beginning  of  the  iteration  (if  any)  is  now  left  without  an  object  (and  becomes 
eligible  to  bid  at  the  next  iteration).  If  an  object  does  not  receive  any  bid  during  an 
iteration,  its  price  and  assignment  status  are  left  unchanged. 

It  call  be  shown  that  if  the  bidding  increments  gi  are  bounded  from  below  by  some  e  > 
0,  this  auction  process  terminates  in  a  finite  number  of  iterations  with  all  persons  having  an 
object.  To  get  a  sense  of  this,  note  that  if  an  object  receives  a  bid  in  m  iterations,  its  price  must 
exceed  its  initial  price  by  at  least  me,  while  if  an  object  is  unassigned,  its  price  has  not  yet 
changed  from  its  initial  value.  Thus,  for  sufficiently  large  m,  the  object  will  become 
"expensive"  enough  to  be  judged  "inferior"  to  some  unassigned  object  by  each  person.  It 
follow's  that  there  is  a  bounded  number  of  iterations  at  which  an  object  can  be  considered  best 
and  thus  be  preferred  to  all  unassigned  objects  by  some  person.  (This  argument  as  stated, 
assumes  that  it  is  feasible  to  assign  any  person  to  any  object  but  it  can  be  generalized  for  the 
case  where  the  set  of  feasible  person-object  pairs  is  limited,  as  long  as  there  exists  at  least  one 
feasible  assignment;  see  e.g.[A.17,  A.18|.) 

Whether  the  complete  assignment  obtained  upon  tennination  of  the  auction  [noecss  is 
optimal  depends  strongly  on  the  method  for  choosing  the  bidding  increments  gj.  In  a  real 
auction,  a  prudent  bidder  would  not  place  an  excessively  high  bid  for  fear  the  object  might  Iv 
won  at  an  unnecessarily  high  price.  Consistent  with  this  intuition,  one  can  show  that  if  the 
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bidding  increment  gj  is  small  enough  to  ensure  that  even  after  the  bid  is  accepted,  the  object 
will  be  "almost  best"  for  the  bidder,  then  the  final  assignment  will  be  "almost  optimal".  In 
particular,  we  can  show  that  if  upon  termination,  we  have 

maxj  (ajj  -  pj  -  e)  ^  ay^  -  pjj  for  all  assigned  pairs  (i,jj)  (A-1 ) 

(a  property  known  as  e-complementary  slackness  or  e-CS  for  short),  then  the  total  benefit  of 
the  final  assignment  is  within  ne  of  being  optimal.  For  a  first  principles  derivation  of  this,  note 
that  the  total  benefit  of  any  complete  assignment  {(i,ji),  i  =  1, n  }  satisfies 

n  n  P 

Z  ^  Z  Pj  +  Z  m‘'«jeA(i)(^ij-Pj) 

i=l  j=l  i=l 

for  any  set  of  prices  pj,  j  =  1, n,  since 

n  n 

Z  maXjgA(i)(aij-Pj)^  Z  (a^.-Pj;) 

i  =  1  i  =  1 


II  n 

S  pji  =  I  pj 


j  = 


Therefore,  the  optimal  total  assignment  benefit  cannot  exceed  the  quantity 


II  11 

A*  =  mm  J,,! . n  I  Z  Pj  +  Z  >Ti‘‘>',€An)(aij-P,)  } 

i  =  I  1  =  1 


(A-2) 


On  the  other  hand,  if  the  e-C.S  property  (A-1)  holds  upon  termination  of  the  auction  process, 
then  by  adding  Hq.  (A  l)  over  all  i,  we  see  that 


II  I!  II 

Z  Pj,  +  Z  ”iaXj,A^i)(a.,-p,l<  X  a;,.  + 


nc 


1  1  1  =  1 


(A-3i 


Since  the  left  side  above  cannot  lx*  less  than  A*,  which  as  argued  earlier,  cannot  Ix’  less  tl...n 
the  optimal  total  assignment  benefit,  v\'e  see  that  the  final  total  assignment  benefit  is  within  nc 
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of  being  optima).  We  note  parentlietically  that  the  preceding  derivation  is  guided  by  duality 
theory;  the  assignment  problem  can  be  formulated  as  a  linear  prognunming  problem,  and  the 
minimization  problem  in  the  right  side  of  Eq.  (A-2)  is  a  dual  problem  (see  e.g.  1A.18,  A.  19]). 

Suppose  now  that  the  benefits  ay  are  all  integer,  which  is  the  typical  practical  case  (if  ay 
are  rational,  they  can  be  scaled  up  to  integer  by  multiplication  with  a  suitable  common  positive 
integer).  Then,  the  total  benefit  of  any  assignment  is  integer,  so  if  ne  <  1,  a  complete 
assignment  that  is  within  n£  of  being  optimal  must  be  optimal.  It  follows,  that  if  e<l/n,  the 
benefits  ay  are  aU  integer,  and  the  e-CS  condition  (A-1)  is  satisfied  upon  termination,  then  the 
assignment  obtained  is  optimal. 

There  is  a  standard  method  for  ehoosing  the  bidding  inerements  gj  so  as  to  maintain  the 
e-CS  condition  (A-1)  throughout  the  auction  process,  assuming  this  condition  is  satisfied  by 
the  initial  prices  and  the  initial  assignment  (as  is  trivially  the  case  when  no  objects  are  assigned 
initially).  In  this  method,  e  is  a  f  v.~d  positive  number,  rmd  the  bidding  increment  g;  is  giN'-o;  by 

gi=e  +  Vi-Wj 

where  v,  is  the  best  object  value, 

V,  =  maxjeA(i)(aij  -  Pj)  (A-,S) 

and  Wj  is  the  "second  best"  object  value 


Wj  -  maxjg  A(i),  j-tjj  (ay  "  Pj)  (A-b) 

where  j;  is  a  best  object  for  which  the  maximum  in  F.q,  (A-5)  is  attained.  We  will  assume  for 
convenience  throughout  that  A(i)  contains  at  least  two  objects,  .so  the  maximum  in  Eq.  ( A-6)  is 
well  dcfinecl. 


A. 2  COMPUTATIONAL  ASPECTS  --  e-SCAIJNTi 

The  AUCI'KON  algorithm  exhibits  interesting  computational  heh.av.or  and  it  is  essential 
to  understand  this  behavior  in  order  to  implement  th?  algorithm  etTlcieiuK  .  We  first  note  that 
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the  amount  of  work  to  solve  the  problem  can  depend  strongly  on  the  value  of  e  and  on  the 
maximunt  absolute  object  value 


C  —  maxjj  ajj  (A-7) 

Basically,  for  many  types  of  problems,  the  number  of  bidding  iterations  up  to  termination  tends 
to  be  proportional  to  C/e.  We  note  also  that  there  is  a  dependence  on  the  initial  prices;  if  these 
prices  are  "near  optimal",  it  can  be  expected  that  the  number  of  iterations  to  solve  the  problem 
will  be  relatively  small.  This  suggests  the  idea  of  e-scaling,  which  consists  of  applying  the 
algorithm  several  times,  starting  with  a  large  value  of  e  and  successively  reducing  e  up  to  an 
ultimate  value  which  is  less  than  the  critical  value  1/n.  Each  application  of  the  algorithm 
provides  good  initial  prices  for  the  next  application. 

In  practice,  it  is  a  good  idea  to  at  least  consider  scaling.  For  sparse  assignment 
problems,  that  is,  prob’.cms  where  the  set  of  feasible  assignment  pairs  is  severely  restricted, 
scaling  seems  almost  universally  helpful.  This  was  established  experimentally  at  the  time  of 
the  original  proposal  of  the  AUCTION  algorithm  1  A. 20 j.  There  is  also  a  related  polynorraal 
complexity  analysis  1  A.18|  that  uses  some  of  the  earlier  ideas  of  an  e-scaling  analysis  I  A.9), 
for  the  e-ielaxation  method  of  |  A. 21 1. 

Our  implementation  of  e-scaling  is  as  follows:  the  integer  benefits  ajj  are  first 
multiplied  by  n+1  and  the  AUCTION  algorithm  is  applied  with  progressively  lower  value  of  e, 
up  to  the  point  w'here  e  becomes  1  or  smaller  (because  ajj  have  been  scaled  bv  n-(-l,  it  is 
sufficient  for  optimality  of  the  final  a.ssignment  to  have  r  <  I ).  'fhe  sequence  oft  values  used  is 

f'(kl  ~  rnaxd.  A/()k),  k  =  0,  1,  ... 

wtiere  A  >  0  and  0  >  1  are  parameters  set  by  tl'.e  user.  Typical  values  iha.t  we  used  for  sparse 
problems  are  A  A74  or  A  -  C/2,  and  I  <  0  <  S. 
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A. 3  THE  TOTALLY  ASYNCHRONOUS  VEKSION  OF  THE  AUCIION 
ALGORITHM 

One  may  view  a  synchronous  parallel  algorithm  as  a  sa^uence  of  consecutive 
computation  segments  called  phases.  The  computations  within  each  phase  are  divided  in  somc 
w'ay  among  the  processors  of  a  parallel  computing  system.  The  computations  of  any  two 
proces.sors  within  each  phase  tire  independent,  so  the  algorithm  is  mathematically  equivalent  to 
some  serial  algorithm.  Phases  are  separated  by  synchronization  points,  which  are  times  at 
which  all  processors  have  completed  the  computations  of  a  given  phase  but  no  processor  has 
yet  started  the  computations  of  the  next  phase.  In  asynchronous  parallel  a  .gorithms,  the 
coordination  of  the  computations  of  the  pivtcessors  is  less  strict.  Processors  are  allowed  to 
pnxeed  wa.h  comp’itations  of  a  phase  with  data  which  may  be  out-of-date  because  the 
computations  of  the  previous  pha.se  are  incomplete.  An  asynchronous  algorithm  may  contain 
;omc  synchronization  points  but  these  are  generally  fewer  »han  the  ones  of  the  corresponding 
,s_\'nchron(,)us  version. 

lo  get  a  first  idea  of  the  totally  asynchronous  implementation  of  the  .AIJCTIO.N 
algorithm,  it  is  useful  to  tnink  ofa  person  as  an  autonomous  decision  maker  that  obtains  at 
utipredicttible  times  ir.fonntitton  tibout  the  prices  of  ti.e  objects.  Each  unassigned  person  mtikes 
a  bid  tit  arbit'-ary  times  on  ttie  basis  of  its  current  object  price  infr.-mation  (that  may  be  outdated 
bc^  a.use  of  communication  delays).  1-urthemiore,  assignment  of  objects  may  be  decided  even 
it  some  poteniitil  bidilers  f.ave  not  ix.-en  hetird  from,  i  iine  are  basically  two  conditions  that 
niu'-t  be  observed  m  order  tor  tlus  jiiocess  to  terminate  pronerly.  We  state  roughly  these 
c.indiiions  beiovs  and  we  \vill  gi\e  e,  more  precise  formulation  shortlv. 

I  .  .An  imassign  .1  person  will  bid  for  .some  object  within  ‘Inite  time,  and  cann' 't  bid 
twice  (i.e..  rar.noi  bad  for  a  seeoiui  object  wltde  waiting  for  ;i  repilv  regaiding  the 
d i  .|io ot ion  ol  ail  e.  rl ler  ind  lor  anolt'cr  obr'ci ), 

’  V,  iic;ie\  er .  me  or  na  ae  is,:  ■  arc  received  th.il  nnse  the  price  o|  ,•  !  ob|ecl.  then, 
v.iilim  tinite  im.e.  itsit  [iricc  must  Iv  i.pdtilcd.  and  its  value  mii':  ix*  cemmunica’cil 
Oioi  .(.css:, Ills  simu'i.iiKS'usly )  to  all  nersons.  bmtlicrn loie.  perso  hat  'x;s  losi 
In '  a  sienei  |  ( ibicct  must  be  intoimed  within  I  mite  true  ot  the  ce.uiee  in  i  ,-  lem.  icn: 
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We  now  formulate  the  totally  asynchronous  model  of  the  AUCTION  algorithm,  and  we 
prove  its  validity.  We  denote 

Pj(t)  =  Price  of  object  j  at  time  t 

rj(t)  =  Persoii  assigned  to  object  j  at  time  t  [rj(t)  =  0  if  object  j  is  una.ssigned] 

U(t)  =  Set  of  unassigned  persons  at  time  t[i  e  U(t)  if  rj(t)  ^  i  for  all  objects  j] 

We  assume  that  U(t),  pj(t)  and  rj(t)  can  change  only  at  integer  times  t;  this  involves  no 
loss  of  generality,  since  t  may  be  viewed  as  tfie  index  of  a  sequence  of  physical  times.  In 
addition  to  U(t),  pj(t)  and  rj(t),  the  algorithm  maintains  at  each  time  t,  a  subset  R(t)  c  U(t)  of 
unassigned  persons  that  may  be  viewed  as  having  a  "ready  bid"  at  time  t.  We  assume  that  by 
time  t,  a  person  i  e  R(t)  has  used  prices  pj(tij(t))  from  some  earlier  times  tijlt)  <  t  to  compute 
the  best  value 

vi(t)  =  maxje  A(i)  (aij  -  Pi(t„(t)))  (A-S) 

a  best  object  jj(t)  attaining  the  above  maximum, 

j,(t)  =  arg  maXje  A(i)  (a.j  -  Pj(t,j(t)))  ( A-b) 

tlie  second  best  value 

w,(t)  =  maxj.-z  A(i),  ,/|,u)  la,,  -  p,(t,|(io)  (A-lOi 

and  lias  determined  a  bid 


b,  t)  =-  p|j(t,|  U))  +  v,(t)  -  w,(i)  4  t  ( A-1  1) 

'1  lie  .,n[-)Iic'Uion  here  is  that  unassigned  persons  i  will  enter  the  set  R(t)  a.nd  become  eligible  to 
Ivd  alter  some  compulations,  which  update  A't)  and  b,(i).  1  l;;we\er,  to  m,;\imi/e  tlie  eeneralu\ 
..I’d  t|o\ibilit\  ot  oiir  model,  tlie  pievise  mechanism  by  which  ihe^e  romp  eat loiis  aie  done  is 
1  nil  ■.peciMedi  siibjeel  lo  I'le  tollo.-.mg  iwo  assiimpiioiis: 

\ssiim[)ti(»n  I  (  ( t  j  /  implies  K' i  i  /  '!»  for  s(ime  i  ■  i. 

Assiimplion  2.  l  or  ,ill  i,  ].  .cui  i.  :;m|  ,  ,  l,|iii 
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Clearly  an  asynchronous  AUCTION  algorithm  cannot  solve  the  problem  if  unassigned 
persons  stop  submitting  bids  and  if  old  information  is  not  eventually  discorded.  This  is  the 
motivation  for  the  preceding  two  assumptions.  Initially,  each  person  is  assigned  to  at  most 
one  object,  that  is,  rj(0)  ^  rj'(O)  for  all  assigned  objects  j  and  j',  and  it  will  be  seen  that  the 
algorithm  preser\'es  this  property  throughout  i*s  course.  Fiirthe’^'^'''*,  mitially  e-CS  holds  tnat 
is, 

maxk  (aik  -  Pk(0)  -  e)  <  ajj  -  pj(())  if  i  -  rj(0) 

It  will  be  shown  shortly  that  this  property  is  also  preserved  during  the  algorithm. 

At  each  time  t,  if  all  persons  are  assigned  |U(t)=0],  the  algorithm  terminates. 
Otherwise,  if  R(t)  =  <I>,  nothing  happens.  If  R(t)  is  nonempty  the  following  occur: 

1 .  A  nonempty  subset  I(t)  c  R(t)  of  persons  that  have  a  bid  ready  is  selected 

2 .  Each  object  j  for  w'hich  the  corresponding  bidder  set 

Bj(t)  =  |i  e  I(t)  Ij  =ji{t)) 
is  nonempty,  detemtinos  the  highest  bid 

b|(t)  =  maxk: Bjii;  hilt) 

and  a  [vrson  i|(t)  for  which  tiie  alxtve  maximum  is  atttuned: 

ijtt)  =  arg  max,.  b,(t) 

Then,  the  pair  Ip,'!).  r|(t)|  is  changed  according  to 

|p,'I+l  ).  r,(i+!  )|  -  |b,(l).  i|(t)|  it  b,(t)  >  p,ii)  ^  E 

-  I  P|( !),  rp  1  )|  (iihcrw)sc  i  A  - 1  ) 

'  •  liiai  il  l,|M  )  1  ill  ( ;  w  1\(  1 1  P  >r  all  t.  ilk  n  liic  asvnchn  'iiou--  .:lgi  a  a  i:!;i  is  cquualcnl  to 

ii.iiK  >nous  \  Sion  ili'-c:: '  -ol  in  Sc^  lion  .\.  1 . 


(A- 12) 


(A- 13) 
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The  asynchronous  model  becomes  relevant  in  a  parallel  computation  context  where 
some  processors  compute  bids  for  some  unassigned  persons,  while  other  processors 
simultaneously  update  some  of  the  object  prices  and  corresponding  assigned  persons. 

Suppose  that  a  single  processor  calculates  a  bid  of  person  i  by  using  the  values  ay  -  pj(tij(t)) 
prevailing  at  times  tij(t)  and  then  calculates  the  maximum  value  at  time  t.  Then,  if  the  price  of 
an  object  j  e  A(i)  is  updated  between  times  tyft)  and  t  by  some  other  processor,  the  maximum 
value  will  be  based  on  out-of-date  information.  The  asynchronous  algorithm  models  this 
possibility  by  allowing  tij(t)  <  t.  A  similar  situation  arises  when  the  bid  of  person  i  is 
calculated  cooperatively  by  several  processors  rather  than  by  a  single  processor. 

The  following  proposition  establishes  the  validity  of  the  asynchronous  AUCTION 
algorithm  of  this  section. 

Proposition  1 :  Let  Assumptions  1  and  2  hold  and  as'  ume  that  there  exists  at  least  one 
complete  assignment.  Then  for  all  t  md  all  j  for  which  r|ri)  ^  0,  the  pair 
lpj(t),rj(t)]  satisfies  the  e-CS  condition 

maxk  (aik  -  PkU)  -  e)  <aij-pj(t)  ifi  =  rj(t)  (A-16) 

Furthemtorc,  there  is  a  finite  time  at  which  the  algorithm  temiinates.  The 
complete  assignment  obtained  upon  tennination  is  within  ne  of  being 
optimal,  and  is  optim.  I  if  e  <l/n  and  the  benefits  ajj  are  integer. 

proof:  Let  lpj(t),rj(t)]  be  a  pair  with  rj(t)  ^  0.  To  simplify  notation,  let  i  =  rj{t).  We  first 

consider  times  t  at  which  pj  was  just  updated,  i.e.,  Pj(t)  >  p,(t- 1 )  and  i  ^  rj(  t- 1 ),  and  person  i 

submitted  a  higiiest  bid  for  object  j  at  time  t-1.  Then  we  have  by  construction 

ai,  -  p,(l)  ^  ;i,j  -  b,(t-l)  =  a,-,  -  p,(t,,(i-i))  -  v,(i  '  .-,(11  )  c 

^  W|(t-ll-i:>  mtixkf  An  >. 

where  the  last  inccjualiiy  follows  using  the  fact  pk")  f'”  Pk*f>  lor  all  k  and  t  u  itli  t  ^  i'. 

'1  herelnre.  the  t;  ('.S  condition  lA-lo)  lioids  for  all  t  at  whicii  j'|  w.is  I’ot  nt'd.aled. 
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Next  we  consider  times  t  for  which  pj  was  not  just  updated.  Let  t'  be  the  largest  time 
which  is  less  than  t  and  for  which  pj(t')  ;  pj(t'-l);  this  is  the  largest  time  prior  to  t  that  object] 
was  assigned  to  person  i.  By  the  preceding  argument,  we  have 

aij  -  pj(t')  >  maxke  A(i)  (aik  -  Pk(t'))  -  e 

and  since  pj(t')=pj(t),  and  pk(t)  >  Pk(t')  for  all  k,  the  e-CS  condition  (A- 16)  again  follows. 

We  next  show  that  the  algorithm  terminates  in  finite  time.  We  first  note  the  following: 

a.  Once  an  object  is  assigned,  it  remains  assigned  for  the  remainder  of  the  algorithm. 
Furthermore,  an  unassigned  object  has  a  price  equal  to  its  initial  price.  Using  Eqs. 
(A-8)  and  (A-10),  we  have  Wi(t)  <  vi(t),  so  from  Eq.  (A-1 1)  we  see  that  bi(t)  > 
Pjj(tijj(t))  +  e.  It  follows  from  Eq.(A-13)  that  if  person  i  bids  for  object]  at  time  t, 

we  must  have 


bj(t)  >  pj(tij(t))  +  e  (A- 17) 

b.  Each  time  an  object]  receives  a  bid  bj(t)  at  time  t,  there  are  two  possibilities:  either 
bj(t)  <  pj(t)+e,  in  which  case  pj(t+l)=pj(t),  or  else  bj(t)  >  pj(t)+t;,  in  which  ca:>e 
Pj(t+1)  >  pj(t)+e  and  pj(t)  increases  by  at  least  e[cf.  Eq.  (A-15)].  In  the  later  case 
we  call  the  bid  suhstaniivc.  Suppose  that  an  object  receives  an  infinite  number  of 
bids  during  the  algorithm.  Then,  an  infinite  subset  of  these  bids  must  be 
substantive;  otherwise  pj(t)would  stay  constant  for  t  sufficiently  large,  we  would 
have  pj(tij(t))  =  pj(t)  for  t  sufficiently  large  because  old  price  infnmtation  is 
eventually  purged  from  the  system  (cf.  Assumption  2),  and  in  view  of  Eqs. (A-1 5) 
and  (A- 17)  we  would  have  pj(t+l )  >  pj(t)  +  e  for  all  times  t  at  which  j  receives  a 
bid,  arriving  at  a  contradiction. 

Assume  now.  in  order  to  obtain  a  contradiction,  that  the  algorithm  dtK's  not  tenninate 
tinitcly.  Then,  because  of  Assumption  1 ,  there  is  an  infinite  number  of  times  t  at  which  R(t)  is 
nonempty  anti  at  each  of  these  times,  at  least  one  object  receives  ;i  bid.  Thus,  there  is  a 
nonem[ny  subset  of  objects  .1  ’  which  receive  an  inllmte  number  of  bids,  and  ;i  nonemptv 
siilwei  of  [lersons  1  '  which  submit  ;in  inlimte  numivr  of  bills.  In  vieu  oi'  ^  i  above,  the  prices 
ol  nil  ob|ci  I  s  m  J  ■  ini  I  ease  to  ainl  in  view  ol  (a )  at  Hive  all  objects  in  .1  .uv  assigned  for  I 
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lufficiently  large.  Furthermore,  the  prices  of  all  objects  j  e  J°°  stay  constant  for  t  sufficiently 
large  and  since  old  information  is  purged  from  the  system  (cf.  Assumption  2),  we  also  have 
Pj(tij(t))  =  pj(t)  for  all  i,j  g  J°°,  and  t  sufficiently  large.  These  facts  imply  that  for  sufficiently 
large  t,  every  object]  e  A(i)  which  is  not  in  J"”  would  be  preferable  for  person  i  to  every  object 
jt  A(i)  n  J"".  Since  the  e-CS  condition  (1)  holds  throughout  the  algorithm,  we  see  that  for 
each  person  i  e  I°°  we  must  have  A(i)  cz  J°°;  otherwise  such  a  person  would  bid  for  an  object 
not  in  J°°  for  sufficiendy  large  t. 

We  now  note  that  for  sufficiently  large  t,  the  only  bids  taking  place  will  be  by  persons 
in  bidding  for  objects  in  J°°,  so  each  object  in  J“°  will  be  assigned  to  some  person  from  l°°, 
while  at  least  one  person  in  I*”  will  be  unassigned  (otherwise  the  algorithm  would  terminate). 
We  conclude  that  the  number  of  persons  in  \°°  is  larger  than  the  number  of  objects  in  J°°.  This, 
together  with  the  earlier  shown  fact  A(i)  c  J'",  for  all  i  a  I°“,  implies  that  there  is  no  complete 
assignment,  contradicting  our  assumprions. 

The  opdmality  properties  of  the  assignment  obtained  upon  temiination  follow  from  the 
e-CS  property  .sliown  and  our  earlier  discussion  on  the  synchronous  version  of  the  algorithm, 
q.e.d. 


l[t  .c,; 
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